apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.23k stars 2.39k forks source link

[SUPPORT] The parquet files for the MOR table have been generated, but the RO table in Hive still cannot query the latest data in the parquet files. #10907

Open Toroidals opened 4 months ago

Toroidals commented 4 months ago

Tips before filing an issue

Describe the problem you faced

Writing to hudi mor table in Flink and syncing to Hive, there is a significant data latency (6+ hours) when querying the ro table in Hive. The HDFS directory shows that .parquet files have been generated for the past few minutes, but the ro table in Hive still does not show the data from the past few minutes. Only the data from the past 6+ hours is visible.

To Reproduce Steps to reproduce the behavior:

//hive sync conf options.put(FlinkOptions.HIVE_SYNC_ENABLED.key(), "true"); options.put(FlinkOptions.HIVE_SYNC_MODE.key(), "hms"); options.put(FlinkOptions.HIVE_SYNC_DB.key(), "ods_rbs"); options.put(FlinkOptions.HIVE_SYNC_TABLE.key(), "ods_rbs_rbscmfprd_cmf_fin_acct_distributions_cdc"); options.put(FlinkOptions.HIVE_SYNC_CONF_DIR.key(), "/etc/hive/conf"); options.put(FlinkOptions.HIVE_SYNC_METASTORE_URIS.key(), connectInfo.get("hive_metastore_url")); options.put(FlinkOptions.HIVE_SYNC_JDBC_URL.key(), connectInfo.get("conn_url")); options.put(FlinkOptions.HIVE_SYNC_SUPPORT_TIMESTAMP.key(), "true"); options.put(FlinkOptions.HIVE_SYNC_SKIP_RO_SUFFIX.key(), "true");

//compaction conf options.put(FlinkOptions.COMPACTION_TASKS.key(), 4); options.put(FlinkOptions.COMPACTION_TRIGGER_STRATEGY.key(), "num_or_time"); options.put(FlinkOptions.COMPACTION_DELTA_COMMITS.key(), "5"); options.put(FlinkOptions.COMPACTION_DELTA_SECONDS.key(), "300"); options.put(FlinkOptions.COMPACTION_MAX_MEMORY.key(), "1024");

select _flink_cdc_ts_ms,_flink_cdc_table,last_update_date from ods_rbs.ods_rbs_rbscmfprd_cmf_fin_acct_distributions_cdc a order by _flink_cdc_ts_ms desc limit 50; +--------------------------+-----------------------------+------------------------+ | _flink_cdc_ts_ms | _flink_cdc_table | last_update_date | +--------------------------+-----------------------------+------------------------+ | 2024-03-21 21:15:02.773 | cmf_fin_acct_distributions | 2024-03-21 18:05:02.0 | | 2024-03-21 21:15:02.773 | cmf_fin_acct_distributions | 2024-03-21 18:05:02.0 | | 2024-03-21 21:15:02.772 | cmf_fin_acct_distributions | 2024-03-21 18:05:02.0 | | 2024-03-21 21:15:02.772 | cmf_fin_acct_distributions | 2024-03-21 18:03:07.0 | | 2024-03-21 21:15:02.772 | cmf_fin_acct_distributions | 2024-03-21 18:03:43.0 | | 2024-03-21 21:15:02.771 | cmf_fin_acct_distributions | 2024-03-21 18:03:00.0 | | 2024-03-21 21:15:02.771 | cmf_fin_acct_distributions | 2024-03-21 18:02:49.0 | | 2024-03-21 21:15:02.771 | cmf_fin_acct_distributions | 2024-03-21 18:02:08.0 | | 2024-03-21 21:15:02.77 | cmf_fin_acct_distributions | 2024-03-21 18:02:08.0 |

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

ad1happy2go commented 4 months ago

@Toroidals Just to check, if you just restart the hive cli, do you see latest data?

Toroidals commented 4 months ago

只是为了检查,如果你只是重新启动 hive cli,你会看到最新数据吗? I have tried restarting the Hive client, but still cannot query the latest data. image image image image

Toroidals commented 4 months ago

@ad1happy2go The amount of data in my table is very large, around 350 million records. The size of each individual Parquet file is around 400MB. I am using index type=bucket, hoodie.bucket.index.num.buckets=SIMPLE, and hoodie.bucket.index.num.buckets=128. Can increasing hoodie.bucket.index.num.buckets and compaction tasks help mitigate this issue, or are there other good solutions available?

ad1happy2go commented 4 months ago

@Toroidals Ideally you should not see this issue at all. I see the parquet file with size 0. Was that the parquet file you are missing in hive.

@danny0405 Any insights here?

danny0405 commented 4 months ago

Is the compaction triggered normally?

xicm commented 3 months ago

Can you check the last_commit_time_sync in hive? And check if the compaction instant is finished. Maybe the compaction is still running.

ad1happy2go commented 3 months ago

@Toroidals Did you got a chance to check it? Were you able to identify the root cause for the issue?