[bug] Repeating hudi_table_changes query on the same large table gets stuck

zyclove commented 10 months ago

Describe the problem you faced

To Reproduce

Steps to reproduce the behavior:

1. spark-sql --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.0 --master yarn --driver-memory 8g --num-executors 10 --conf spark.dynamicAllocation.maxExecutors=20 --executor-memory 4G --executor-cores 2 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar --conf spark.sql.autoBroadcastJoinThreshold=2G --conf spark.memory.storageFraction=0.5 --conf spark.sql.broadcastTimeout=60000 --conf spark.yarn.priority=5 --conf spark.sql.broadcastTimeout=600000 --conf spark.network.timeout=600000s --conf spark.eventLog.enable=false --conf spark.driver.maxResultSize=4g --conf spark.driver.extraJavaOptions=-XX:-UseGCOverheadLimit --conf spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit --name zyc_test --conf spark.dynamicAllocation.enabled=false

2.SELECT count(1) FROM hudi_table_changes('bi_ods_real.ods_log_smart_datapoint_report_batch_rt', 'latest_state', '20231114033500000', '20231114040500000');

Results can be returned normally

SELECT count(1) FROM hudi_table_changes('bi_ods_real.ods_log_smart_datapoint_report_batch_rt', 'latest_state', '20231114033500000', '20231114040500000');

It's stuck and won't exit. It's hard to exit by Ctrl+C.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version :0.14.0
Spark version :3.2.1
Hive version :3.1.3
Hadoop version :3.2.2
Storage (HDFS/S3/GCS..) :s3
Running on Docker? (yes/no) :no

ad1happy2go commented 10 months ago

@zyclove I tried the same scenario but it was working fine for me and queries were running fine. Can you try the same in your setup. Is this only happening for one table?

CREATE TABLE hudi_table (
    ts BIGINT,
    uuid STRING,
    rider STRING,
    driver STRING,
    fare DECIMAL(10,4),
    city STRING
) USING HUDI
tblproperties (
type = 'mor', primaryKey = 'uuid', preCombineField = 'ts'
)
PARTITIONED BY (city);

INSERT INTO hudi_table
VALUES
(1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',100001.0001,'san_francisco'),
(1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',100001.0001 ,'san_francisco');

INSERT INTO hudi_table
VALUES
(1695159649089,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',100001.0001,'san_francisco'),
(1695091554790,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',100001.0001 ,'san_francisco');

INSERT INTO hudi_table
VALUES
(1695159649091,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',100001.0001,'san_francisco'),
(1695091554790,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',100001.0001 ,'san_francisco');

SELECT count(1) FROM
hudi_table_changes('hudi_table_rt', 'latest_state', '20231114033500000', '20231116152700000');
SELECT count(1) FROM
hudi_table_changes('hudi_table_rt', 'latest_state', '20231114033500000', '20231116152700000');
SELECT count(1) FROM
hudi_table_changes('hudi_table_rt', 'latest_state', '20231114033500000', '20231116152700000');

watermelon12138 commented 10 months ago

I can not reproduce it too.

kazdy commented 10 months ago

--executor-memory 4G --executor-cores 2

this can be too small for a large table hudi_table_changes is just a wrapper on top of spark data source incremental query, do you see the same issue with it as well?

apache / hudi

[bug] Repeating hudi_table_changes query on the same large table gets stuck #10096