apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.35k stars 2.19k forks source link

Two-level parquet read EOF error: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [a, array] repeated int32 array = 2 at value 4 out of 4 in current page. repetition level: -1, definition level: -1 #9497

Open gaoshihang opened 9 months ago

gaoshihang commented 9 months ago

Apache Iceberg version

1.4.3 (latest release)

Query engine

Spark

Please describe the bug 🐞

We have a two-level parquet list, the schema like below: image

Now if this array is an empty array: [], when we using add_files function add this parquet to a table, then query will throw this exception:

Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [a, array] repeated int32 array = 2 at value 4 out of 4 in current page. repetition level: -1, definition level: -1
    at org.apache.iceberg.parquet.PageIterator.handleRuntimeException(PageIterator.java:220)
    at org.apache.iceberg.parquet.PageIterator.nextInteger(PageIterator.java:141)
    at org.apache.iceberg.parquet.ColumnIterator.nextInteger(ColumnIterator.java:121)
    at org.apache.iceberg.parquet.ColumnIterator$2.next(ColumnIterator.java:41)
    at org.apache.iceberg.parquet.ColumnIterator$2.next(ColumnIterator.java:38)
    at org.apache.iceberg.parquet.ParquetValueReaders$UnboxedReader.read(ParquetValueReaders.java:246)
    at org.apache.iceberg.parquet.ParquetValueReaders$RepeatedReader.read(ParquetValueReaders.java:467)
    at org.apache.iceberg.parquet.ParquetValueReaders$OptionReader.read(ParquetValueReaders.java:419)
    at org.apache.iceberg.parquet.ParquetValueReaders$StructReader.read(ParquetValueReaders.java:745)
    at org.apache.iceberg.parquet.ParquetReader$FileIterator.next(ParquetReader.java:130)
    at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:65)
    at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49)
    at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:129)
    at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:93)
    at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:130)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read int
    at org.apache.parquet.column.values.plain.PlainValuesReader$IntegerPlainValuesReader.readInteger(PlainValuesReader.java:114)
    at org.apache.iceberg.parquet.PageIterator.nextInteger(PageIterator.java:139)
    ... 32 more
Caused by: java.io.EOFException
    at org.apache.parquet.bytes.SingleBufferInputStream.read(SingleBufferInputStream.java:52)
    at org.apache.parquet.bytes.LittleEndianDataInputStream.readInt(LittleEndianDataInputStream.java:347)
    at org.apache.parquet.column.values.plain.PlainValuesReader$IntegerPlainValuesReader.readInteger(PlainValuesReader.java:112)
    ... 33 more

And I read the code in Iceberg-parquet, it seems like this do-while will never exit: image

gaoshihang commented 9 months ago

And here is the iceberg schema v8.metadata.json

gaoshihang commented 9 months ago

And here is the parquet file we used to add_files. (need to change the .log to .parquet) user_error_parquet.log

mathfool commented 9 months ago

I think this is because the rl and dl in the initialization is 1 below the expected value. so try to submit a fix.

github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.