apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.33k stars 2.42k forks source link

[SUPPORT] Wrong table path when using Hive to query xxx_rt table before the first compaction #4978

Open ghost opened 2 years ago

ghost commented 2 years ago

Describe the problem you faced When using Hive to query xxx_rt table,if there is no parquet file but only log files, we get a wrong table path. But when the parquet files are generated, the table path is correct and we can get the data. Is this expected behavior?

ERROR : Job failed with java.io.FileNotFoundException: File does not exist: hdfs://da-hdfs/tmp/hive/hadoop/90b7d231-0e0a-42e5-a72a-6faad6a9ac89/.hoodie
org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in path hdfs://da-hdfs/tmp/hive/hadoop/90b7d231-0e0a-42e5-a72a-6faad6a9ac89/.hoodie
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://da-hdfs/tmp/hive/hadoop/90b7d231-0e0a-42e5-a72a-6faad6a9ac89/.hoodie

Environment Description

xiarixiaoyao commented 2 years ago

@awpengfei yes, for now it is a expected behavior, before call any hudi function, hive will filter out all files which start with '.' so all the log files are filtered out.
you have two way to sovle this problem 1: trigger compaction, after compaction parquet file will generate 2: modify hive souce code, not filter out .log files

CrazyBeeline commented 2 years ago

maybe it can help you @awpengfei @xiarixiaoyao modify source code like this: org.apache.hudi.hadoop.HoodieParquetInputFormat image image

org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat image

nsivabalan commented 2 years ago

is this written using flink? using spark, we won't create log files directly. data files will be created first and then only log files will be created for file groups.

nsivabalan commented 2 years ago

but might be good to get it fixed irrespective of that. just trying to guage the use-case.

nsivabalan commented 2 years ago

@xiarixiaoyao : Alexey did a revamp of all query engine code paths recently w/ 0.11. Do we have this issue even now after 0.11? do you have any idea. do we have a tracking ticket for this.

xiarixiaoyao commented 2 years ago

@nsivabalan i donot think 0.11 can solve this problem. @CrazyBeeline thanks for your help. could you pls raise a pr to solve this problem, thanks very much

nsivabalan commented 2 years ago

@CrazyBeeline : can you put up a patch w/ the fix you have. Happy to review and get it to landing. btw, are you using hbase or some other set up. wondering how did you end up w/ a file group w/ log file but w/o a base file.

nsivabalan commented 2 years ago

@CrazyBeeline : gentle ping.

nsivabalan commented 1 year ago

@danny0405 @xiarixiaoyao : do we know if we have fixed this anytime.

danny0405 commented 1 year ago

No, we have not fixed it, the Hive/Trino all can not access file group with pure logs, can we move it to higher priority for release 0.13.0 and solve it then ?

codope commented 1 year ago

@ad1happy2go to reproduce