facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.54k stars 1.17k forks source link

ParquetColumnReader should be created based on <logical type, file(storage) type, encodings> #11600

Open yingsu00 opened 1 week ago

yingsu00 commented 1 week ago

Bug description

Timestamp is usually represented as int64 in the Parquet files, but ParquetReader creates a IntegerColumnReader instead of TimeStamp column reader even though LogicalType sets to TimeStamp. This is because ParquetColumnReader::build() only checks the file type when deciding what column reader it should build:

std::unique_ptr<dwio::common::SelectiveColumnReader> ParquetColumnReader::build(
    const TypePtr& requestedType,
    const std::shared_ptr<const dwio::common::TypeWithId>& fileType,
    ParquetParams& params,
    common::ScanSpec& scanSpec) {
  auto colName = scanSpec.fieldName();

  switch (fileType->type()->kind()) {.   // file type check only
    case TypeKind::INTEGER:
    case TypeKind::BIGINT:
    case TypeKind::SMALLINT:
    case TypeKind::TINYINT:
    case TypeKind::HUGEINT:
      return std::make_unique<IntegerColumnReader>(
          requestedType, fileType, params, scanSpec);

Note that the Parquet column reader kinds should be combinations of <logical type, file(storage) type, encodings>. The current implementation only checks file(storage) type, which makes the decoders code mixed up and not well structured. A better way is to organize all decoders and ColumnReaders based on the <logical type, file(storage) type, encodings> triplets.

System information

NA

Relevant logs

NA
Yuhta commented 1 week ago

Do we need encodings here? It's better to decouple encodings from column readers and let decoders to handle encodings. So ideally it should be (logical type, physical type) -> column reader encoding -> decoder

zuyu commented 1 week ago

I think when we try to read a timestamp int64 column from Parquet, we should make sure velox type fileType->type()->kind() to be TypeKind::TIMESTAMP, similar to timestamp int96. I'm still in the middle of figuring it out. This was found when reusing timestamp int96 unit test utilities for #11530.

In other words, we do not need to check logicalType at all.

yingsu00 commented 6 days ago

Do we need encodings here? It's better to decouple encodings from column readers and let decoders to handle encodings. So ideally it should be (logical type, physical type) -> column reader encoding -> decoder

Agree that (logical type, physical type) -> column reader But decoders will be better managed by the combination of the three. For example, we could be having the following decoders:

This is what I meant the ColumnReaders shall be decided by the triplets, since the ColumnReader will need to decide what decoder it uses. I intend to refactor the current Parquet decoders to make it easier to maintain in the future.

yingsu00 commented 6 days ago

I think when we try to read a timestamp int64 column from Parquet, we should make sure velox type fileType->type()->kind() to be TypeKind::TIMESTAMP, similar to timestamp int96. I'm still in the middle of figuring it out. This was found when reusing timestamp int96 unit test utilities for #11530.

In other words, we do not need to check logicalType at all.

The fileType is of ParquetTypeWithId, and its type() function returns the type_, which is the VeloxType. Normally VeloxType should be the same as Parquet Logical type, but it is not 100% so. See https://github.com/facebookincubator/velox/issues/9767 Note that there is no Parquet logical type for Velox Type Timestamp with the storage format INT96. In ParquetTypeWithId there are also parquetType_(Parquet storage type) and logicalType_(Parquet logical type) available. When the parquetType_(Parquet storage type) is INT64 and logicalType_(Parquet logical type) is Timestamp, we should be creating a TimestampColumnReader. So I would say the type of ColumnReader should be decided by the Parquet logical type and VeloxType, to correct the above tuple:

( Velox type, Parquet logical type) -> Parquet column reader

As to the bug you're working on, both VeloxType and Parquet logical type should be Timestamp because the storage type is INT64. If you are seeing VeloxType is not Timestamp, then there is something wrong. Let me know why that's the case and we will see if we need to fix the problem in this issue. Maybe it doesn't need change just for this case, but overall speaking we need to check both Velox type and Parquet logical type to decide what ColumnReader to create.

Yuhta commented 6 days ago

Agree that (logical type, physical type) -> column reader But decoders will be better managed by the combination of the three. For example, we could be having the following decoders:

  • Int32ShortDecimalPlainValuesDecoder. (storage type: Int32, logical type: ShortDecimal, encoding: plain)
  • Int32ShortDecimalDeltaValuesDecoder: (storage type: Int32, logical type: ShortDecimal, encoding: delta)
  • FixedLenByteArrayShortDecimalPlainValuesDecoder: (storage type: FixedLenByteArray, logical type: ShortDecimal, encoding: plain)
  • FixedLenByteArrayUuidPlainValuesDecoderL: (storage type: FixedLenByteArray, logical type: Uuid, encoding: plain)

This is what I meant the ColumnReaders shall be decided by the triplets, since the ColumnReader will need to decide what decoder it uses. I intend to refactor the current Parquet decoders to make it easier to maintain in the future.

Make sense, just we need to be careful that if we are going to create a lot of parquet decoders, we need to design it so that the common logic for fast/slow path is shared (see nimble decoders below for example). Otherwise it will soon be unmaintainable (especially for fast path) and we will need to rewrite the parquet decoders again.

https://github.com/facebookincubator/nimble/blob/2ba00420da5af11bed6fe4ceeb38f5b333bab8e8/dwio/nimble/encodings/Encoding.h#L366 https://github.com/facebookincubator/nimble/blob/2ba00420da5af11bed6fe4ceeb38f5b333bab8e8/dwio/nimble/encodings/Encoding.h#L411

yingsu00 commented 5 days ago

Agree that (logical type, physical type) -> column reader But decoders will be better managed by the combination of the three. For example, we could be having the following decoders:

  • Int32ShortDecimalPlainValuesDecoder. (storage type: Int32, logical type: ShortDecimal, encoding: plain)
  • Int32ShortDecimalDeltaValuesDecoder: (storage type: Int32, logical type: ShortDecimal, encoding: delta)
  • FixedLenByteArrayShortDecimalPlainValuesDecoder: (storage type: FixedLenByteArray, logical type: ShortDecimal, encoding: plain)
  • FixedLenByteArrayUuidPlainValuesDecoderL: (storage type: FixedLenByteArray, logical type: Uuid, encoding: plain)

This is what I meant the ColumnReaders shall be decided by the triplets, since the ColumnReader will need to decide what decoder it uses. I intend to refactor the current Parquet decoders to make it easier to maintain in the future.

Make sense, just we need to be careful that if we are going to create a lot of parquet decoders, we need to design it so that the common logic for fast/slow path is shared (see nimble decoders below for example). Otherwise it will soon be unmaintainable (especially for fast path) and we will need to rewrite the parquet decoders again.

https://github.com/facebookincubator/nimble/blob/2ba00420da5af11bed6fe4ceeb38f5b333bab8e8/dwio/nimble/encodings/Encoding.h#L366 https://github.com/facebookincubator/nimble/blob/2ba00420da5af11bed6fe4ceeb38f5b333bab8e8/dwio/nimble/encodings/Encoding.h#L411

Yeah for sure. Thanks for the links!