Open mzheng-plaid opened 1 month ago
Did you use Spark for data ingestion?
@danny0405 sorry yes correct
Did you enable the inference execution of Spark?
No, speculative execution is not enabled
@mzheng-plaid Did you tried to disabling the vectorised reader spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
Were you able to read this parquet file using spark.read.parquet
Yes, the parquet file itself is corrupted. Trying to read the parquet file segfaults:
❯ RUST_BACKTRACE=1 pqrs cat ./xxx.parquet --json | jq '.model_output' | sort | uniq -c
Trying to read with spark.read.format("parquet").load("xxx.parquet")
fails as expected (regardless of spark.sql.parquet.enableVectorizedReader
, I tried with it set to false
):
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [foo] optional float foo at value 204757 out of 463825, 4757 out of 20000 in currentPage. repetition level: 0, definition level: 1
at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:553)
at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:439)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:229)
... 19 more
Caused by: java.lang.ArrayIndexOutOfBoundsException
Describe the problem you faced
(This seems related to https://github.com/apache/hudi/issues/10029#issuecomment-2253533412)
We are running into a data corruption bug with Hudi ingestion into a table which we suspect is happening at the
parquet-java
layer due to some interaction with Hudi.Column
foo
is of float type and is an enum that has valid values from 0 to 5. There seems to be a bug in the parquet dictionary encoding where somehow a value of 6 was written which is outside the 0-5 range.This is problem because Hudi successfully commits the transaction, and then subsequent reads of the file fail (which also blocks ingestion due to upserts touching the corrupted file)
xxx.parquet
without modifying the timeline? We are ok with data loss localized to this one corrupted fileTo Reproduce
Unsure
Expected behavior
Environment Description
This is run on EMR 6.10.1
Hudi version : 0.12.2-amzn-0
Spark version : 3.3.1
Hive version : 3.1.3
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Yes
Additional context
N/A
Stacktrace
See above