apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.94k stars 980 forks source link

DRILL-8421: Parquet microsecond columns #2793

Closed handmadecode closed 1 year ago

handmadecode commented 1 year ago

DRILL-8421: Truncate parquet microsecond columns

Description

The metadata min and max values of parquet microsecond columns are truncated to milliseconds, which is the time unit expected by the initial file pruning during filtering. Also, TIME_MICROS columns are read as 64-bit values before they are truncated to 32-bit milliseconds values. Previously they were read as 32-bit values, causing values > Integer.MAX_VALUE to be incorrect.

The second fix also addresses DRILL-8423.

Documentation

Bugfix only, no documentation changes

Testing

Unit tests added in new test class org.apache.drill.exec.store.parquet.TestMicrosecondColumns.

cgivre commented 1 year ago

@handmadecode Thanks for the contribution and welcome to Drill! Would you mind rebasing once DRILL-8424 is merged? There are some CI issues which will be fixed by that PR. Thanks!

handmadecode commented 1 year ago

@cgivre thanks, happy to contribute. I will rebase when 8424 is merged.

jnturton commented 1 year ago

Thanks for the contribution and welcome to Drill! Would you mind rebasing once https://github.com/apache/drill/pull/2794 is merged?

Heh, I just came here to type exactly this. I reviewed the code changes and they look great so really we just need the CI run after rebasing.

cgivre commented 1 year ago

@handmadecode https://github.com/apache/drill/pull/2794 has been merged.

handmadecode commented 1 year ago

@jnturton Happy to help!