When utilizing Velox to read data from Spark, we've observed that certain data types are not represented identically between Spark and Parquet files. This discrepancy results in a runtime error when the data returned by the Parquet reader differs from what Spark anticipates. We've identified the following types as problematic:
u8 -> i16
u16 -> i32
u32 -> i64
u64 -> decimal(20, 0)
DateType ignores rebaseMode conf
TimeStampType ignores rebaseMode conf
For instance, while reading columns through Velox, Gluten creates a Velox scan node based on the format expected by Spark. However, due to the incompatible data representation, an error arises as exemplified by the following log:
This error states that a BIGINT type was returned for the field n0_0 at position 0, while a DECIMAL(20,0) was expected.
It is not possible to know the actual parquet type at plan time therefor this converstion must be done at runtime. The conversations done by Spark at ParquetVectorUpdaterFactory.java needs to be honored.
Proposed Solution
As part of this enhanceent I have identified 3 changes required to add unsigned type support to Velox when reading from Gluten / Spark.
Add Parquet's LogicalType information to Velox's ParquetTypeWithId
Recursively Add outputType to the ScanSpec
Based on Parquet's logical type and Spark's requested output type, use appropriate converstions when creating Flat Vectors using Parquet Column Readers.
Task List
[x] Make thrift::LogicalType a field of facebook::velox::parquet::ParquetTypeWithId. #5787
Problem Description
When utilizing Velox to read data from Spark, we've observed that certain data types are not represented identically between Spark and Parquet files. This discrepancy results in a runtime error when the data returned by the Parquet reader differs from what Spark anticipates. We've identified the following types as problematic:
For instance, while reading columns through Velox, Gluten creates a Velox scan node based on the format expected by Spark. However, due to the incompatible data representation, an error arises as exemplified by the following log:
This error states that a BIGINT type was returned for the field n0_0 at position 0, while a DECIMAL(20,0) was expected.
It is not possible to know the actual parquet type at plan time therefor this converstion must be done at runtime. The conversations done by Spark at ParquetVectorUpdaterFactory.java needs to be honored.
Proposed Solution
As part of this enhanceent I have identified 3 changes required to add unsigned type support to Velox when reading from Gluten / Spark.
Task List