Open Toroidals opened 4 months ago
@Toroidals Just to check, if you just restart the hive cli, do you see latest data?
只是为了检查,如果你只是重新启动 hive cli,你会看到最新数据吗? I have tried restarting the Hive client, but still cannot query the latest data.
![]()
![]()
![]()
@ad1happy2go The amount of data in my table is very large, around 350 million records. The size of each individual Parquet file is around 400MB. I am using index type=bucket, hoodie.bucket.index.num.buckets=SIMPLE, and hoodie.bucket.index.num.buckets=128. Can increasing hoodie.bucket.index.num.buckets and compaction tasks help mitigate this issue, or are there other good solutions available?
@Toroidals Ideally you should not see this issue at all. I see the parquet file with size 0. Was that the parquet file you are missing in hive.
@danny0405 Any insights here?
Is the compaction triggered normally?
Can you check the last_commit_time_sync
in hive? And check if the compaction instant is finished. Maybe the compaction is still running.
@Toroidals Did you got a chance to check it? Were you able to identify the root cause for the issue?
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
Writing to hudi mor table in Flink and syncing to Hive, there is a significant data latency (6+ hours) when querying the ro table in Hive. The HDFS directory shows that .parquet files have been generated for the past few minutes, but the ro table in Hive still does not show the data from the past few minutes. Only the data from the past 6+ hours is visible.
To Reproduce Steps to reproduce the behavior:
//hive sync conf options.put(FlinkOptions.HIVE_SYNC_ENABLED.key(), "true"); options.put(FlinkOptions.HIVE_SYNC_MODE.key(), "hms"); options.put(FlinkOptions.HIVE_SYNC_DB.key(), "ods_rbs"); options.put(FlinkOptions.HIVE_SYNC_TABLE.key(), "ods_rbs_rbscmfprd_cmf_fin_acct_distributions_cdc"); options.put(FlinkOptions.HIVE_SYNC_CONF_DIR.key(), "/etc/hive/conf"); options.put(FlinkOptions.HIVE_SYNC_METASTORE_URIS.key(), connectInfo.get("hive_metastore_url")); options.put(FlinkOptions.HIVE_SYNC_JDBC_URL.key(), connectInfo.get("conn_url")); options.put(FlinkOptions.HIVE_SYNC_SUPPORT_TIMESTAMP.key(), "true"); options.put(FlinkOptions.HIVE_SYNC_SKIP_RO_SUFFIX.key(), "true");
//compaction conf options.put(FlinkOptions.COMPACTION_TASKS.key(), 4); options.put(FlinkOptions.COMPACTION_TRIGGER_STRATEGY.key(), "num_or_time"); options.put(FlinkOptions.COMPACTION_DELTA_COMMITS.key(), "5"); options.put(FlinkOptions.COMPACTION_DELTA_SECONDS.key(), "300"); options.put(FlinkOptions.COMPACTION_MAX_MEMORY.key(), "1024");
select
_flink_cdc_ts_ms
,_flink_cdc_table
,last_update_date from ods_rbs.ods_rbs_rbscmfprd_cmf_fin_acct_distributions_cdc a order by_flink_cdc_ts_ms
desc limit 50; +--------------------------+-----------------------------+------------------------+ | _flink_cdc_ts_ms | _flink_cdc_table | last_update_date | +--------------------------+-----------------------------+------------------------+ | 2024-03-21 21:15:02.773 | cmf_fin_acct_distributions | 2024-03-21 18:05:02.0 | | 2024-03-21 21:15:02.773 | cmf_fin_acct_distributions | 2024-03-21 18:05:02.0 | | 2024-03-21 21:15:02.772 | cmf_fin_acct_distributions | 2024-03-21 18:05:02.0 | | 2024-03-21 21:15:02.772 | cmf_fin_acct_distributions | 2024-03-21 18:03:07.0 | | 2024-03-21 21:15:02.772 | cmf_fin_acct_distributions | 2024-03-21 18:03:43.0 | | 2024-03-21 21:15:02.771 | cmf_fin_acct_distributions | 2024-03-21 18:03:00.0 | | 2024-03-21 21:15:02.771 | cmf_fin_acct_distributions | 2024-03-21 18:02:49.0 | | 2024-03-21 21:15:02.771 | cmf_fin_acct_distributions | 2024-03-21 18:02:08.0 | | 2024-03-21 21:15:02.77 | cmf_fin_acct_distributions | 2024-03-21 18:02:08.0 |Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.