facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.52k stars 1.15k forks source link

Parquet Reader Decoder Support Status #9767

Open yingsu00 opened 6 months ago

yingsu00 commented 6 months ago

Description

Normal Data Page Types

Velox Type Parquet LogicalType Parquet ConvertedType Parquet Storage Type Supported?
BOOLEAN     BOOLEAN (1 bit) Partial
Tinyint INT(8, true) INT_8 = 15 (deprecated) INT32 Y
Smallint INT(16, true) INT_16 = 16 (deprecated) INT32 Y
Integer INT(32, true) INT_32 = 17 (deprecated) INT32 Y
Bigint INT(64, true) INT_64 = 18 (deprecated) INT64 Y
Tinyint INT(8, false) UINT_8 = 11 (deprecated) INT32 Y
Smallint INT(16, false) UINT_16 = 12 (deprecated) INT32 Y
Integer INT(32, false) UINT_32 = 13 (deprecated) INT32 Y
Bigint INT(64, false) UINT_64 = 14 (deprecated) INT64 Y
Hugeint     FIXED_LEN_BYTE_ARRAY (len = 16) Y
ShortDecimal Decimal 1 <= precision <= 9 DECIMAL = 5 INT32 Y
ShortDecimal Decimal  1 <= precision <= 18 DECIMAL = 5 INT64 Y
Short/LongDecimal Decimal  precision limited by len DECIMAL = 5 FIXED_LEN_BYTE_ARRAY Y
Short/LongDecimal Decimal  precision unlimited DECIMAL = 5 BYTE_ARRAY N
Real FLOAT16   FIXED_LEN_BYTE_ARRAY (len = 2) N
Real     FLOAT Y
Double     DOUBLE Y
DateType DATE DATE = 6 INT32 Y
  TIME(isAdjustedToUTC=True/False, unit=MILLIS) TIME_MILLIS = 7 (deprecated) INT32 N
  TIME(isAdjustedToUTC=True/False, unit=MICROS) TIME_MICROS = 8. (deprecated) INT64 N
  TIME(isAdjustedToUTC=True/False, unit=NANOS)   INT64 N
Timestamp TIMESTAMP(isAdjustedToUTC=True/False, unit=MILLIS) TIMESTAMP_MILLIS = 9 (deprecated) INT64 https://github.com/facebookincubator/velox/pull/8325
Timestamp TIMESTAMP(isAdjustedToUTC=True/False, unit=MICROS) TIMESTAMP_MICROS = 10 (deprecated)   https://github.com/facebookincubator/velox/pull/8325
Timestamp TIMESTAMP(isAdjustedToUTC=True/False, unit=NANOS)     https://github.com/facebookincubator/velox/pull/8325
Timestamp     INT96(deprecated) N
CustomType::TimeStampWithTimeZone TIMESTAMP(isAdjustedToUTC=False)   INT64 N
IntervalDayTimeType INTERVAL INTERVAL = 21 FIXED_LEN_BYTE_ARRAY (len=12) N
IntervalYearMonthType INTERVAL INTERVAL = 21 FIXED_LEN_BYTE_ARRAY (len=12) N
VARCHAR STRING UTF8 = 0 BYTE_ARRAY Y
VARCHAR ENUM ENUM = 4 BYTE_ARRAY Y
VARCHAR UUID   FIXED_LEN_BYTE_ARRAY (len=16) N
VARBINARY STRING BYTE_ARRAY N
CustomType::JSON JSON JSON = 19 BYTE_ARRAY N
  BSON BSON = 20 BYTE_ARRAY N
Array LIST LIST = 3   Y
Row LIST LIST = 3   Y
Map MAP MAP_KEY_VALUE = 2   Y
Map MAP MAP = 1   Y
UnknownType UNKNOWN (always null)      

Normal Data Page Encodings

Parquet Storage Type Parquet Encoding Version Supported
BOOLEAN (1 bit) Plain (0) 1 Y
BOOLEAN (1 bit) RLE/BP (3) 1 N
INT32 Plain (0) 1 Y
INT32 DELTA_BINARY_PACKED (5) 2 Y
INT64 DELTA_BINARY_PACKED (5) 2 Y
FLOAT Plain (0) 1 Y
FLOAT BYTE_STREAM_SPLIT (9) 2 N
DOUBLE Plain (0) 1 Y
DOUBLE BYTE_STREAM_SPLIT (9) 2 N
FIXED_LEN_BYTE_ARRAY Plain (0) 1 Partial for certain types
FIXED_LEN_BYTE_ARRAY DELTA_BYTE_ARRAY (7) 2 N
BYTE_ARRAY Plain (0) 1 Y
BYTE_ARRAY DELTA_BYTE_ARRAY (7) 2 N
BYTE_ARRAY DELTA_LENGTH_BYTE_ARRAY (6) 2 N

Dictionary Page Encodings

Parquet  Type Parquet Encoding Supported
BOOLEAN Plain (0) Y
INT32 Plain (0) Y
INT64 Plain (0) Y
INT96(deprecated) Plain (0) N
FLOAT Plain (0) Y
DOUBLE Plain (0) Y
BYTE_ARRAY Plain (0) Y
FIXED_LEN_BYTE_ARRAY Plain (0) Y

Repetition/Definition Levels

Parquet  Type Parquet Encoding Supported
INT32 RLE/BP (3) Y needs to be updated
INT32 BIT_PACKED (4) (deprecated) N
jkhaliqi commented 2 months ago

Parquet Version 2 Data Types

Parquet file created from Presto Java

The two Encodings that did not show up when creating parquet files from Presto Java was BYTE_STREAM_SLIT(Float, Double) and DELTA_LENGTH_BYTE_ARRAY(varchar, string, binary).

Using Spark we were also not able to create the parquet file to use these encodings, but with Apache Arrow we were able to create a parquet file to use these encoding by changing around the WriterProperties as seen from this doc: https://arrow.apache.org/docs/cpp/parquet.html#writer-properties

Following is a list of the type and parquet encoding for V2 parquet table created from Presto Java

Presto Type Parquet Type Parquet Encodings
Boolean Boolean RLE
TinyInt INT32 DELTA_BINARY_PACKED
smallint INT32 DELTA_BINARY_PACKED
Integer INT32 DELTA_BINARY_PACKED
Bigint INT64 DELTA_BINARY_PACKED
REAL FLOAT PLAIN
DOUBLE DOUBLE PLAIN
DECIMAL FIXED_LEN_BYTE_ARRAY RLE_DICTIONARY
VARCHAR BYTE_ARRAY DELTA_BYTE_ARRAY
Char BYTE_ARRAY DELTA_BYTE_ARRAY
VarBinary BYTE_ARRAY DELTA_BYTE_ARRAY
JSON create table tmp(json json);Query 20240829_223247_00157_hpkyz failed: No default Hive type provided for unsupported Hive type: json On docs for parquet:Unsupported logical types: JSON, BSON, UUID. If such a type is encountered when reading a Parquet file, the default physical type mapping is used (for example, a Parquet JSON column may be read as Arrow Binary or FixedSizeBinary).https://arrow.apache.org/docs/cpp/parquet.html#logical-types
Date INT32 DELTA_BINARY_PACKED
Time create table tmp(time time);Query 20240829_223030_00153_hpkyz failed: No default Hive type provided for unsupported Hive type: time  
Time With Time Zone    
Timestamp INT64 DELTA_BINARY_PACKED
Timestamp with timezone    
Interval year to month create table tmp(iym interval year to month);Query 20240829_223424_00163_hpkyz failed: No default Hive type provided for unsupported Hive type: interval year to month  
Interval day to second create table tmp(iym interval);Query 20240829_223452_00165_hpkyz failed: line 1:18: Unknown type 'interval' for column 'iym'create table tmp(iym interval)  
array(integer) INT32 DELTA_BINARY_PACKED
array(boolean) BOOLEAN RLE
map(integer, integer) INT32 DELTA_BINARY_PACKED
row("f0" varbinary, "f1" timestamp) Broke it down to what was inside -> Broke it down to what was inside -> {"PathInSchema":["P0","F0"],"Type":"BYTE_ARRAY","Encodings":["RLE_DICTIONARY"],"CompressedSize":186422,"UncompressedSize":223747,"NumValues":1000,"CompressionCodec":"GZIP"},{"PathInSchema":["P0","F1"],"Type":"INT64","Encodings":["RLE_DICTIONARY"],"CompressedSize":2527,"UncompressedSize":2500,"NumValues":1000,"NullCount":234,"MaxValue":9197623049880936755,"MinValue":58472672228734950,"CompressionCodec":"GZIP"}
IPADDRESS create table tmp(ipaddress ipaddress);Query 20240829_222932_00151_hpkyz failed: No default Hive type provided for unsupported Hive type: ipaddress  
IPPREFIX create table tmp(ip ipprefix);Query 20240829_223604_00167_hpkyz failed: No default Hive type provided for unsupported Hive type: ipprefix  
UUID create table tmp(u uuid);Query 20240829_223636_00168_hpkyz failed: No default Hive type provided for unsupported Hive type: uuid On docs for parquet:Unsupported logical types: JSON, BSON, UUID. If such a type is encountered when reading a Parquet file, the default physical type mapping is used (for example, a Parquet JSON column may be read as Arrow Binary or FixedSizeBinary).https://arrow.apache.org/docs/cpp/parquet.html#logical-types