Two-level parquet read EOF error: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [a, array] repeated int32 array = 2 at value 4 out of 4 in current page. repetition level: -1, definition level: -1

gaoshihang commented 9 months ago

Apache Iceberg version

1.4.3 (latest release)

Query engine

Spark

Please describe the bug 🐞

We have a two-level parquet list, the schema like below:

Now if this array is an empty array: [], when we using add_files function add this parquet to a table, then query will throw this exception:

Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [a, array] repeated int32 array = 2 at value 4 out of 4 in current page. repetition level: -1, definition level: -1
    at org.apache.iceberg.parquet.PageIterator.handleRuntimeException(PageIterator.java:220)
    at org.apache.iceberg.parquet.PageIterator.nextInteger(PageIterator.java:141)
    at org.apache.iceberg.parquet.ColumnIterator.nextInteger(ColumnIterator.java:121)
    at org.apache.iceberg.parquet.ColumnIterator$2.next(ColumnIterator.java:41)
    at org.apache.iceberg.parquet.ColumnIterator$2.next(ColumnIterator.java:38)
    at org.apache.iceberg.parquet.ParquetValueReaders$UnboxedReader.read(ParquetValueReaders.java:246)
    at org.apache.iceberg.parquet.ParquetValueReaders$RepeatedReader.read(ParquetValueReaders.java:467)
    at org.apache.iceberg.parquet.ParquetValueReaders$OptionReader.read(ParquetValueReaders.java:419)
    at org.apache.iceberg.parquet.ParquetValueReaders$StructReader.read(ParquetValueReaders.java:745)
    at org.apache.iceberg.parquet.ParquetReader$FileIterator.next(ParquetReader.java:130)
    at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:65)
    at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49)
    at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:129)
    at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:93)
    at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:130)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read int
    at org.apache.parquet.column.values.plain.PlainValuesReader$IntegerPlainValuesReader.readInteger(PlainValuesReader.java:114)
    at org.apache.iceberg.parquet.PageIterator.nextInteger(PageIterator.java:139)
    ... 32 more
Caused by: java.io.EOFException
    at org.apache.parquet.bytes.SingleBufferInputStream.read(SingleBufferInputStream.java:52)
    at org.apache.parquet.bytes.LittleEndianDataInputStream.readInt(LittleEndianDataInputStream.java:347)
    at org.apache.parquet.column.values.plain.PlainValuesReader$IntegerPlainValuesReader.readInteger(PlainValuesReader.java:112)
    ... 33 more

And I read the code in Iceberg-parquet, it seems like this do-while will never exit:

gaoshihang commented 9 months ago

And here is the iceberg schema v8.metadata.json

gaoshihang commented 9 months ago

And here is the parquet file we used to add_files. (need to change the .log to .parquet) user_error_parquet.log

mathfool commented 9 months ago

I think this is because the rl and dl in the initialization is 1 below the expected value. so try to submit a fix.

github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

apache / iceberg