Open yingsu00 opened 1 week ago
Do we need encodings here? It's better to decouple encodings from column readers and let decoders to handle encodings. So ideally it should be (logical type, physical type) -> column reader encoding -> decoder
I think when we try to read a timestamp int64 column from Parquet, we should make sure velox type fileType->type()->kind()
to be TypeKind::TIMESTAMP
, similar to timestamp int96. I'm still in the middle of figuring it out. This was found when reusing timestamp int96 unit test utilities for #11530.
In other words, we do not need to check logicalType
at all.
Do we need encodings here? It's better to decouple encodings from column readers and let decoders to handle encodings. So ideally it should be (logical type, physical type) -> column reader encoding -> decoder
Agree that (logical type, physical type) -> column reader But decoders will be better managed by the combination of the three. For example, we could be having the following decoders:
This is what I meant the ColumnReaders shall be decided by the triplets, since the ColumnReader will need to decide what decoder it uses. I intend to refactor the current Parquet decoders to make it easier to maintain in the future.
I think when we try to read a timestamp int64 column from Parquet, we should make sure velox type
fileType->type()->kind()
to beTypeKind::TIMESTAMP
, similar to timestamp int96. I'm still in the middle of figuring it out. This was found when reusing timestamp int96 unit test utilities for #11530.In other words, we do not need to check
logicalType
at all.
The fileType is of ParquetTypeWithId, and its type() function returns the type_
, which is the VeloxType. Normally VeloxType should be the same as Parquet Logical type, but it is not 100% so. See https://github.com/facebookincubator/velox/issues/9767 Note that there is no Parquet logical type for Velox Type Timestamp with the storage format INT96. In ParquetTypeWithId there are also parquetType_
(Parquet storage type) and logicalType_
(Parquet logical type) available. When the parquetType_
(Parquet storage type) is INT64 and logicalType_
(Parquet logical type) is Timestamp, we should be creating a TimestampColumnReader. So I would say the type of ColumnReader should be decided by the Parquet logical type and VeloxType, to correct the above tuple:
( Velox type, Parquet logical type) -> Parquet column reader
As to the bug you're working on, both VeloxType and Parquet logical type should be Timestamp because the storage type is INT64. If you are seeing VeloxType is not Timestamp, then there is something wrong. Let me know why that's the case and we will see if we need to fix the problem in this issue. Maybe it doesn't need change just for this case, but overall speaking we need to check both Velox type and Parquet logical type to decide what ColumnReader to create.
Agree that (logical type, physical type) -> column reader But decoders will be better managed by the combination of the three. For example, we could be having the following decoders:
- Int32ShortDecimalPlainValuesDecoder. (storage type: Int32, logical type: ShortDecimal, encoding: plain)
- Int32ShortDecimalDeltaValuesDecoder: (storage type: Int32, logical type: ShortDecimal, encoding: delta)
- FixedLenByteArrayShortDecimalPlainValuesDecoder: (storage type: FixedLenByteArray, logical type: ShortDecimal, encoding: plain)
- FixedLenByteArrayUuidPlainValuesDecoderL: (storage type: FixedLenByteArray, logical type: Uuid, encoding: plain)
This is what I meant the ColumnReaders shall be decided by the triplets, since the ColumnReader will need to decide what decoder it uses. I intend to refactor the current Parquet decoders to make it easier to maintain in the future.
Make sense, just we need to be careful that if we are going to create a lot of parquet decoders, we need to design it so that the common logic for fast/slow path is shared (see nimble decoders below for example). Otherwise it will soon be unmaintainable (especially for fast path) and we will need to rewrite the parquet decoders again.
https://github.com/facebookincubator/nimble/blob/2ba00420da5af11bed6fe4ceeb38f5b333bab8e8/dwio/nimble/encodings/Encoding.h#L366 https://github.com/facebookincubator/nimble/blob/2ba00420da5af11bed6fe4ceeb38f5b333bab8e8/dwio/nimble/encodings/Encoding.h#L411
Agree that (logical type, physical type) -> column reader But decoders will be better managed by the combination of the three. For example, we could be having the following decoders:
- Int32ShortDecimalPlainValuesDecoder. (storage type: Int32, logical type: ShortDecimal, encoding: plain)
- Int32ShortDecimalDeltaValuesDecoder: (storage type: Int32, logical type: ShortDecimal, encoding: delta)
- FixedLenByteArrayShortDecimalPlainValuesDecoder: (storage type: FixedLenByteArray, logical type: ShortDecimal, encoding: plain)
- FixedLenByteArrayUuidPlainValuesDecoderL: (storage type: FixedLenByteArray, logical type: Uuid, encoding: plain)
This is what I meant the ColumnReaders shall be decided by the triplets, since the ColumnReader will need to decide what decoder it uses. I intend to refactor the current Parquet decoders to make it easier to maintain in the future.
Make sense, just we need to be careful that if we are going to create a lot of parquet decoders, we need to design it so that the common logic for fast/slow path is shared (see nimble decoders below for example). Otherwise it will soon be unmaintainable (especially for fast path) and we will need to rewrite the parquet decoders again.
https://github.com/facebookincubator/nimble/blob/2ba00420da5af11bed6fe4ceeb38f5b333bab8e8/dwio/nimble/encodings/Encoding.h#L366 https://github.com/facebookincubator/nimble/blob/2ba00420da5af11bed6fe4ceeb38f5b333bab8e8/dwio/nimble/encodings/Encoding.h#L411
Yeah for sure. Thanks for the links!
Bug description
Timestamp is usually represented as int64 in the Parquet files, but ParquetReader creates a IntegerColumnReader instead of TimeStamp column reader even though LogicalType sets to TimeStamp. This is because ParquetColumnReader::build() only checks the file type when deciding what column reader it should build:
Note that the Parquet column reader kinds should be combinations of <logical type, file(storage) type, encodings>. The current implementation only checks file(storage) type, which makes the decoders code mixed up and not well structured. A better way is to organize all decoders and ColumnReaders based on the <logical type, file(storage) type, encodings> triplets.
System information
NA
Relevant logs