Closed handmadecode closed 3 months ago
LGTM +1 Thanks for the contribution! Can you please update the documentation for the Parquet reader to include this? Otherwise looks good!
Happy to contribute!
Do you mean the documentation in the drill-site
repo? (https://github.com/apache/drill-site/blob/master/_docs/en/data-sources-and-file-formats/040-parquet-format.md)
LGTM +1 Thanks for the contribution! Can you please update the documentation for the Parquet reader to include this? Otherwise looks good!
Happy to contribute!
Do you mean the documentation in the
drill-site
repo? (https://github.com/apache/drill-site/blob/master/_docs/en/data-sources-and-file-formats/040-parquet-format.md)
That's the one!
Awesome work. I can backport this too because you've left default behaviour unchanged (and it's self contained). My only question is about ParquetReaderConfig -- you've added switches to the global config system so I don't think we also need them to appear in the storage plugin config?
Thanks.
The reason I added the configuration properties to ParquetReaderConfig
is that they are needed in org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector::addColumnMetadata
(to decide whether the metadata will be compared to timestamps with millisecond precision or to raw microsecond values). I couldn't find an easy way to access the global configuration values from there, but the ParquetReaderConfig
instance was already available.
However, I could very well have overlooked some way to access the global configuration, and I'd be grateful for any pointers.
However, I could very well have overlooked some way to access the global configuration, and I'd be grateful for any pointers.
We should existing find examples in the Parquet format plugin. E.g. the "old" reader is affected by the option store.parquet.reader.pagereader.async.
Follow up: I see use of the FragmentContext class for accessing config options in the old Parquet reader, perhaps it's a good a vehicle...
Follow up: I see use of the FragmentContext class for accessing config options in the old Parquet reader, perhaps it's a good a vehicle...
FragmentContext
is used to access the new config options where an instance already was available, i.e. in ColumnReaderFactory
, ParquetToDrillTypeConverter
, and DrillParquetGroupConverter
.
FileMetadataCollector
doesn't have access to a FragmentContext
, only to a ParquetReaderConfig
.
I can only find a FragmentContext in one of the two call paths to FileMetadataCollector::addColumnMetadata
, so trying to inject a FragmentContext
into FileMetadataCollector
will probably have an impact on quite a few other classes.
It's always bugged me that we don't have a global way of accessing at least one of DrillbitContext, QueryContext, FragmentContext or just OptionManager. We hardly want to have to spray these things through APIs everywhere in Drill. I'll take a look at whether something can be done...
@jnturton @handmadecode
IMHO, would it be better to avoid using global config options and just put the configuration options in the parquetConfig
file? This would be more consistent with the other file types that Drill can read.
@jnturton Can we merge this?
DRILL-8492: Read parquet microsecond columns as bigint
Description
Two new configuration options,
store.parquet.reader.time_micros_as_int64
andstore.parquet.reader.timestamp_micros_as_int64,
have been added.When reading Parquet columns of type
time_micros
andtimestamp_micros
, the returned value will be the original 64-bit integer value instead of a timestamp value truncated to milliseconds if the configuration option for the column type istrue
.The implementation closely follows how the existing option
store.parquet.reader.int96_as_timestamp
is handled.Both new options have the default value
false
to preserve existing behaviour as default.Documentation
The new options can be set in "drill-module.conf", and also be set through the Web UI's Options page.
Testing
Unit tests have been added to
org.apache.drill.exec.store.parquet.TestMicrosecondColumns
.It could be worth noting that microsecond columns can be compared to time or timestamp literals even if the option for reading the column's value as a 64-bit value is true:
SELECT * FROM file.parquet WHERE TO_TIME(time_micros_column/1000) > '09:32:58.174';
andSELECT * FROM file.parquet WHERE TO_TIMESTAMP(timestamp_micros_column/1000000) > '2024-04-26 13:17:41.421';