Open Shubham21k opened 8 months ago
@Shubham21k Can you try with 0.14.1 once. Also, async services doesn't work with Datasource writer.
I tried to reproduce this but unable to do it. Can you check in case you can enhance the same and reproduce it.
Code here - https://gist.github.com/ad1happy2go/364e66c4fa84229110f28994cc4a277f/edit
@Shubham21k What queries you are trying on this data? Does select * works? For point in time queries, this error is expected in case the commit is not archive but cleaned.
@ad1happy2go
not able to open the code link for reproducing the error shared by you
Is there a work around for this issue. We are facing a similar issue as well.
@Shubham21k Code link here - https://gist.github.com/ad1happy2go/364e66c4fa84229110f28994cc4a277f
Async services are meant to run with streaming workloads like Hudi Streamer, so that table services can run asynchronously and doesn't block the ingestion of next micro batch. Having it with Data source writers (batch writers) doesn't make any sense and inline table services will be kicked in.
@shravanak Which Hudi version you are using? Are you also using insert_overwrite. Can you elaborate.
We are using insert write mode with hudi 0.14.0 I think the file or partition it is referring to missing might be before we upgraded to 0.14.0 which was on 0.12.2
@shravanak That may be the cause probably. Did you faced this issue with other tables also?
@shravanak Are you still facing this issue? Let us know in case you need help here.
hey @ad1happy2go : if this turns out to be MDT data consistency issue, do keep me posted. thanks.
We are incrementally writing to a hudi table with insert_overwrite operations. Recently, We enabled Hudi metadata table for these tables. However after few days we started to encounter the
FileNotFoundException
issue while reading these tables from athena (with metadata listing enabled). Upon further investigation, we observed that the metadata contained older files references that were cleaned up by the cleaner and are no longer available.Steps to reproduce the behavior:
create a simple df and write to a hudi table incrementally with these properties
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiOutputTablePath)
after few incremental writes, some of the base files should be updated & metadata does not get updated properly, it continues to persist old files pointer as well.
if you try reading the table using spark or athena, you will get FileNotFoundException keep in mind to enable metadata while reading. upon disabling the metadata listing on the read side, there is no error and reads work fine.
Note : We have observed this issue only for insert_overwrite operations. Upsert operation table's metadata gets updated correctly.
Expected behavior
It is expected that the hoodie metadata gets updated correctly.
Environment Description
Hudi version : 0.13.1
Spark version : 3.2.1
Hive version : NA
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) :
Additional context
The timeline also contains replaceCommits for corrupted tables. (which are not present in case of upsert table)
Also, here is the output of the command metadata validate-files ran from the hudi-cli on the corrupted table