apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.95k stars 979 forks source link

DRILL-8492: Read Parquet Microsecond Columns as Bigint #2907

Closed handmadecode closed 3 months ago

handmadecode commented 7 months ago

DRILL-8492: Read parquet microsecond columns as bigint

Description

Two new configuration options, store.parquet.reader.time_micros_as_int64 and store.parquet.reader.timestamp_micros_as_int64, have been added.

When reading Parquet columns of type time_micros and timestamp_micros, the returned value will be the original 64-bit integer value instead of a timestamp value truncated to milliseconds if the configuration option for the column type is true.

The implementation closely follows how the existing option store.parquet.reader.int96_as_timestamp is handled.

Both new options have the default value false to preserve existing behaviour as default.

Documentation

The new options can be set in "drill-module.conf", and also be set through the Web UI's Options page.

Testing

Unit tests have been added to org.apache.drill.exec.store.parquet.TestMicrosecondColumns.

It could be worth noting that microsecond columns can be compared to time or timestamp literals even if the option for reading the column's value as a 64-bit value is true:

SELECT * FROM file.parquet WHERE TO_TIME(time_micros_column/1000) > '09:32:58.174'; and SELECT * FROM file.parquet WHERE TO_TIMESTAMP(timestamp_micros_column/1000000) > '2024-04-26 13:17:41.421';

handmadecode commented 6 months ago

LGTM +1 Thanks for the contribution! Can you please update the documentation for the Parquet reader to include this? Otherwise looks good!

Happy to contribute!

Do you mean the documentation in the drill-site repo? (https://github.com/apache/drill-site/blob/master/_docs/en/data-sources-and-file-formats/040-parquet-format.md)

cgivre commented 6 months ago

LGTM +1 Thanks for the contribution! Can you please update the documentation for the Parquet reader to include this? Otherwise looks good!

Happy to contribute!

Do you mean the documentation in the drill-site repo? (https://github.com/apache/drill-site/blob/master/_docs/en/data-sources-and-file-formats/040-parquet-format.md)

That's the one!

handmadecode commented 6 months ago

Awesome work. I can backport this too because you've left default behaviour unchanged (and it's self contained). My only question is about ParquetReaderConfig -- you've added switches to the global config system so I don't think we also need them to appear in the storage plugin config?

Thanks.

The reason I added the configuration properties to ParquetReaderConfig is that they are needed in org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector::addColumnMetadata (to decide whether the metadata will be compared to timestamps with millisecond precision or to raw microsecond values). I couldn't find an easy way to access the global configuration values from there, but the ParquetReaderConfig instance was already available.

However, I could very well have overlooked some way to access the global configuration, and I'd be grateful for any pointers.

jnturton commented 6 months ago

However, I could very well have overlooked some way to access the global configuration, and I'd be grateful for any pointers.

We should existing find examples in the Parquet format plugin. E.g. the "old" reader is affected by the option store.parquet.reader.pagereader.async.

Follow up: I see use of the FragmentContext class for accessing config options in the old Parquet reader, perhaps it's a good a vehicle...

handmadecode commented 6 months ago

Follow up: I see use of the FragmentContext class for accessing config options in the old Parquet reader, perhaps it's a good a vehicle...

FragmentContext is used to access the new config options where an instance already was available, i.e. in ColumnReaderFactory, ParquetToDrillTypeConverter, and DrillParquetGroupConverter.

FileMetadataCollector doesn't have access to a FragmentContext, only to a ParquetReaderConfig. I can only find a FragmentContext in one of the two call paths to FileMetadataCollector::addColumnMetadata, so trying to inject a FragmentContext into FileMetadataCollector will probably have an impact on quite a few other classes.

jnturton commented 6 months ago

It's always bugged me that we don't have a global way of accessing at least one of DrillbitContext, QueryContext, FragmentContext or just OptionManager. We hardly want to have to spray these things through APIs everywhere in Drill. I'll take a look at whether something can be done...

cgivre commented 4 months ago

@jnturton @handmadecode IMHO, would it be better to avoid using global config options and just put the configuration options in the parquetConfig file? This would be more consistent with the other file types that Drill can read.

cgivre commented 4 months ago

@jnturton Can we merge this?