apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 979 forks source link

DRILL-8458: Use correct size of definition level bytes slice when reading Parquet v2 data page #2838

Closed handmadecode closed 11 months ago

handmadecode commented 11 months ago

DRILL-8458: Use correct size of definition level bytes slice when reading Parquet data page

Description

The byte buffer slice used for the definition level bytes when reading a Parquet v2 data page is sized correctly (in the call to ByteBuffer::limit)

Documentation

Bugfix only, no documentation changes

Testing

Unit test added in new test class org.apache.drill.exec.store.parquet.TestRepeatedColumn

handmadecode commented 11 months ago

Argh, a basic buffer arithmetic bug by yours truly. I guess it's remained undetected so far because Parquet v2 is still uncommon the wild. And because of insufficient Parquet v2 test coverage.

I guess we've all caused our fair share of those bugs ;-)

Thank you very much for this fix which looks great. Would you mind seeing if you can relocate the test and its data? We've got TestParquetComplex already and also some Parquet v2 test files for which a naming pattern has been started, e.g.


exec/java-exec/src/test/resources/parquet/parquet_v2_logical_types_simple.parquet

Sure, how about renaming the test file to exec/java-exec/src/test/resources/parquet/parquet_v2_large_repetition_levels.parquet?

Would you prefer if the test was moved to an existing test class, e.g. TestParquetComplex?

jnturton commented 11 months ago

A test data file name like that would be great thank you. Yes, please may you move the test itself inside TestParquetComplex and see if you can integrate your test data file generation code into ParquetSimpleTestFileGenerator? It already takes a Parquet version number parameter IIRC (up until now, only ever set to v1). I think we may need to come back and organise our tests better for Parquet v2, but I don't want to saddle this PR with that reorganisation.

jnturton commented 11 months ago

P.S. if you find the test data generator only fits simply into ParquetSimpleTestFileGenerator as a standalone method that is unrelated to what's there already then I think that's fine. I'm pointing to things I know are there and related but not going so far as to ask "How would this look in code?"

handmadecode commented 11 months ago

PR updated with refactored test code