Open yingsu00 opened 6 months ago
Parquet file created from Presto Java
The two Encodings that did not show up when creating parquet files from Presto Java was BYTE_STREAM_SLIT(Float, Double) and DELTA_LENGTH_BYTE_ARRAY(varchar, string, binary).
Using Spark we were also not able to create the parquet file to use these encodings, but with Apache Arrow we were able to create a parquet file to use these encoding by changing around the WriterProperties as seen from this doc: https://arrow.apache.org/docs/cpp/parquet.html#writer-properties
Following is a list of the type and parquet encoding for V2 parquet table created from Presto Java
Presto Type | Parquet Type | Parquet Encodings |
---|---|---|
Boolean | Boolean | RLE |
TinyInt | INT32 | DELTA_BINARY_PACKED |
smallint | INT32 | DELTA_BINARY_PACKED |
Integer | INT32 | DELTA_BINARY_PACKED |
Bigint | INT64 | DELTA_BINARY_PACKED |
REAL | FLOAT | PLAIN |
DOUBLE | DOUBLE | PLAIN |
DECIMAL | FIXED_LEN_BYTE_ARRAY | RLE_DICTIONARY |
VARCHAR | BYTE_ARRAY | DELTA_BYTE_ARRAY |
Char | BYTE_ARRAY | DELTA_BYTE_ARRAY |
VarBinary | BYTE_ARRAY | DELTA_BYTE_ARRAY |
JSON | create table tmp(json json);Query 20240829_223247_00157_hpkyz failed: No default Hive type provided for unsupported Hive type: json | On docs for parquet:Unsupported logical types: JSON, BSON, UUID. If such a type is encountered when reading a Parquet file, the default physical type mapping is used (for example, a Parquet JSON column may be read as Arrow Binary or FixedSizeBinary).https://arrow.apache.org/docs/cpp/parquet.html#logical-types |
Date | INT32 | DELTA_BINARY_PACKED |
Time | create table tmp(time time);Query 20240829_223030_00153_hpkyz failed: No default Hive type provided for unsupported Hive type: time | |
Time With Time Zone | ||
Timestamp | INT64 | DELTA_BINARY_PACKED |
Timestamp with timezone | ||
Interval year to month | create table tmp(iym interval year to month);Query 20240829_223424_00163_hpkyz failed: No default Hive type provided for unsupported Hive type: interval year to month | |
Interval day to second | create table tmp(iym interval);Query 20240829_223452_00165_hpkyz failed: line 1:18: Unknown type 'interval' for column 'iym'create table tmp(iym interval) | |
array(integer) | INT32 | DELTA_BINARY_PACKED |
array(boolean) | BOOLEAN | RLE |
map(integer, integer) | INT32 | DELTA_BINARY_PACKED |
row("f0" varbinary, "f1" timestamp) | Broke it down to what was inside -> | Broke it down to what was inside -> {"PathInSchema":["P0","F0"],"Type":"BYTE_ARRAY","Encodings":["RLE_DICTIONARY"],"CompressedSize":186422,"UncompressedSize":223747,"NumValues":1000,"CompressionCodec":"GZIP"},{"PathInSchema":["P0","F1"],"Type":"INT64","Encodings":["RLE_DICTIONARY"],"CompressedSize":2527,"UncompressedSize":2500,"NumValues":1000,"NullCount":234,"MaxValue":9197623049880936755,"MinValue":58472672228734950,"CompressionCodec":"GZIP"} |
IPADDRESS | create table tmp(ipaddress ipaddress);Query 20240829_222932_00151_hpkyz failed: No default Hive type provided for unsupported Hive type: ipaddress | |
IPPREFIX | create table tmp(ip ipprefix);Query 20240829_223604_00167_hpkyz failed: No default Hive type provided for unsupported Hive type: ipprefix | |
UUID | create table tmp(u uuid);Query 20240829_223636_00168_hpkyz failed: No default Hive type provided for unsupported Hive type: uuid | On docs for parquet:Unsupported logical types: JSON, BSON, UUID. If such a type is encountered when reading a Parquet file, the default physical type mapping is used (for example, a Parquet JSON column may be read as Arrow Binary or FixedSizeBinary).https://arrow.apache.org/docs/cpp/parquet.html#logical-types |
Description
Normal Data Page Types
Normal Data Page Encodings
Dictionary Page Encodings
Repetition/Definition Levels