apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.45k stars 2.43k forks source link

[SUPPORT] when read parquet files which the file name starts with dot(.) by spark, there will create an error like "Caused by: java.lang.RuntimeException: hdfs://path/.0726018d-0e03-48e2-9b88-d7e228cf1aff_0-4-0-0_20230111085451189.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [39, 78, -93, 10]" #8810

Open king5holiday opened 1 year ago

king5holiday commented 1 year ago

so i want to know why there will be some parquet files which the file name start with dot(.) when I write data to hudi? And how to filter these files when I read hudi by spark? Thank you very much!

danny0405 commented 1 year ago

Did you write the table using Flink, flink could create some imtermediate data files which starts with .. And what verion of Hudi did you use then?

king5holiday commented 1 year ago

Flink

@danny0405 Thank you for your replay! yes, I wrote data to hudi by flink and the version of hudi is 0.11.1

danny0405 commented 1 year ago

Flink

@danny0405 Thank you for your replay! yes, I wrote data to hudi by flink and the version of hudi is 0.11.1

Can you use 0.12.3 or 0.13.1, it should be fixed.

king5holiday commented 1 year ago

Flink

@danny0405 Thank you for your replay! yes, I wrote data to hudi by flink and the version of hudi is 0.11.1

Can you use 0.12.3 or 0.13.1, it should be fixed.

@danny0405 thank you! Should hudi-spark-bundle version and hudi-flink-bundle version be changed to 0.12.3 or 0.13.1 at the same time?

danny0405 commented 1 year ago

Yeah, we better upgrade the bundle jars altogether.

king5holiday commented 1 year ago

Yeah, we better upgrade the bundle jars altogether.

I changed hudi-spark-bundle version from 0.11.1 to 0.13.1 and try to read data by spark, but the error reported again. when i use flink to read, although old version, it went well, so does flink has a filtering mechanism or remove the parquet files that start with dot(.)?

danny0405 commented 1 year ago

Did you read the table by specifying hudi as the format? Or just read it as a raw parquet table.

king5holiday commented 1 year ago

Did you read the table by specifying hudi as the format? Or just read it as a raw parquet table.

hi, there are some codes, spark.read .format("org.apache.hudi") .load(HUDIPATH)

danny0405 commented 1 year ago

Not sure, maybe you can delete the hidden files manually, there is no atomatic fix when upgrading to 0.13.1.

king5holiday commented 1 year ago

Not sure, maybe you can delete the hidden files manually, there is no atomatic fix when upgrading to 0.13.1.

ok, maybe it's an effective way until now. Looking forward to fixing the bug in new version. Thank you!