[SUPPORT] Incremental query not working on COW table

apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.

https://hudi.apache.org/

Apache License 2.0

5.23k stars 2.39k forks source link

[SUPPORT] Incremental query not working on COW table #10850

Open NishantBaheti opened 4 months ago

NishantBaheti commented 4 months ago

Error

Error Category: QUERY_ERROR; AnalysisException: Found duplicate column(s) in the data schema: _hoodie_commit_seqno, _hoodie_commit_time, _hoodie_file_name, _hoodie_partition_path, _hoodie_record_key

Code

hudi_options={ 'hoodie.datasource.query.type': 'incremental', 'hoodie.datasource.read.begin.instanttime': start_time, 'hoodie.datasource.read.end.instanttime': end_time, } df=spark.read\ .format("org.apache.hudi")\ .options(**hudi_options)\ .load(tablePath)

danny0405 commented 4 months ago

Hi, @NishantBaheti , thanks for your feedback, could you also supplement the release version for Spark and Hudi respectively.

NishantBaheti commented 4 months ago

Hello, I am using this jar

hudi-spark3.3-bundle_2.12-0.14.1.jar
spark 3.3
hudi 0.14.1

ad1happy2go commented 4 months ago

@NishantBaheti I checked before, incremental query works fine with 0.14.1. can you paste the full reproducible script or table/writer properties you used to populate. I checked the below code to quickly to reproduce - https://gist.github.com/ad1happy2go/e7a2f8c695fde4c3db060a7113610931

ad1happy2go commented 4 months ago

@NishantBaheti I checked before, incremental query works fine with 0.14.1.

can you paste the full reproducible script or table/writer properties you used to populate. Which writer you used to populate this table?

I checked the below code to quickly to reproduce - https://gist.github.com/ad1happy2go/e7a2f8c695fde4c3db060a7113610931

NishantBaheti commented 4 months ago

doesn't work. another issue.

ad1happy2go commented 3 months ago

@NishantBaheti Were you able to get it resolve? Can you let us know full stack trace. Looks like Unable to load class means some library conflicts.

NishantBaheti commented 3 months ago

@ad1happy2go moved to the MOR table. COW configurations felt a little unstable. had to rush the project to production quickly.

ad1happy2go commented 3 months ago

@NishantBaheti Thanks for the update. Surprisingly MOR worked but COW didn't work you.

NishantBaheti commented 3 months ago

@ad1happy2go COW tables were failing a lot, like at the time of reading parquet file not found, no incremental query or getting the error mentioned above. Not saying that MOR is perfect but still had to put something in production with static configurations of MOR with quick compact cleaner so that athena ro tables behave like delta tables from delta framework and should be able to do point query using record index. I hope they figure out a stable version of hudi soon like how delta did.

ad1happy2go commented 3 months ago

@NishantBaheti For incremental queries we can face FileNotFound Exception if the file for that query got deleted by the cleaner. We can set hoodie.datasource.read.incr.fallback.fulltablescan.enable to true to get around this issue.