apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

ClassCastException possible in DeltaByteArrayReader after PARQUET-2431 #3013

Closed bwjoh closed 1 month ago

bwjoh commented 2 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Noticed when upgrading from 1.13.1 to 1.14.1

java.lang.ClassCastException: class org.apache.parquet.column.values.dictionary.DictionaryValuesReader cannot be cast to class org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader (org.apache.parquet.column.values.dictionary.DictionaryValuesReader and org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader are in unnamed module of loader 'app')
    at org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader.setPreviousReader(DeltaByteArrayReader.java:92)
    at org.apache.parquet.column.impl.ColumnReaderBase.initDataReader(ColumnReaderBase.java:734)
    at org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:766)
    at org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:56)
    at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:695)
    at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:686)
    at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:232)
    at org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:686)
    at org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:660)
    at org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:802)
    at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
    at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:427)

This appears to be due to PARQUET-2431 - https://github.com/apache/parquet-java/pull/1274/files#diff-362b7d44b24283c1bb1f6ca3e124cb72706a33ed96d86b58bf3339f20aafb4e9R732

Looking into how my code hit this and it seems to be that CorruptDeltaByteArrays.requiresSequentialReads was essentially doing the dataColumn instanceof RequiresPreviousReader check previously (CorruptDeltaByteArrays.requiresSequentialReads can only return true when encoding == Encoding.DELTA_BYTE_ARRAY, and org.apache.parquet.column.values.RequiresPreviousReader is only implemented by *DeltaByteArrayReader classes).

With no check on previousReader instanceof RequiresPreviousReader the ClassCastException is possible above.

This is more likely to happen when using org.apache.parquet.io.ColumnIOFactory#ColumnIOFactory() to read files without createdBy. In my case I was able to fix this by adding createdBy, knowing that all Parquet files I have were written after PARQUET-246, which prevents CorruptDeltaByteArrays.requiresSequentialReads from returning true

val reader: ParquetFileReader = ...
val fileMetadata = reader.getFooter.getFileMetaData
val createdBy = fileMetadata.getCreatedBy
val columnIO: MessageColumnIO = new ColumnIOFactory(createdBy)...

Component(s)

No response

wgtmac commented 2 months ago

Thanks for reporting the bug! Is it possible to provide a file that can reproduce this issue?

cc @gszadovszky this issue seems to be caused by a recent refactoring commit.

gszadovszky commented 2 months ago

Thanks, @bwjoh. It seems I've overlooked how this part worked. The code is not super clear, unfortunately. Also, seems we are lacking a unit test for this scenario. Would you like to contribute a fix for this one?