Open alexandros-kyriakides opened 9 months ago
To see the encoding of a parquet file using duckdb:
D SELECT encodings FROM parquet_metadata('example_file_plain_encoding.parquet');
┌────────────────────────┐
│ encodings │
│ varchar │
├────────────────────────┤
│ RLE, BIT_PACKED, PLAIN │
└────────────────────────┘
D SELECT encodings FROM parquet_metadata('example_file_delta_binary_packed_encoding.parquet');
┌──────────────────────────────────────┐
│ encodings │
│ varchar │
├──────────────────────────────────────┤
│ RLE, BIT_PACKED, DELTA_BINARY_PACKED │
└──────────────────────────────────────┘
D
Some extra info:
It is the same for Encoding = 6 (DELTA_LENGTH_BYTE_ARRAY) as well.
Version: 3.2.4-613f0b5
We're encountering this storing larger text data. From the docs:
DELTA_LENGTH_BYTE_ARRAY = 6: This encoding is always preferred over PLAIN for byte array columns.
@derekperkins @alexandros-kyriakides What's the version are you using?
I used v3.3-rc2
I am on Starrocks version:
version info
Version: 3.3.0
Git: 19a3f66
Build Info: StarRocks@localhost (Ubuntu 22.04.3 LTS)
Build Time: 2024-06-21 11:48:40
And here is my encoding:
D SELECT encodings FROM parquet_metadata('C:\projs\go-generators\data\parquet\obt-1b\transactions\5US3cJJlmVsLVkwx2r0ZVw==.parquet') limit 10;
┌─────────────────────────┐
│ encodings │
│ varchar │
├─────────────────────────┤
│ DELTA_LENGTH_BYTE_ARRAY │
│ DELTA_LENGTH_BYTE_ARRAY │
│ DELTA_LENGTH_BYTE_ARRAY │
│ DELTA_LENGTH_BYTE_ARRAY │
│ PLAIN │
│ PLAIN │
│ PLAIN │
│ PLAIN │
│ DELTA_LENGTH_BYTE_ARRAY │
│ DELTA_LENGTH_BYTE_ARRAY │
├─────────────────────────┤
│ 10 rows │
└─────────────────────────┘
D
And I get the same error:
type:LOAD_RUN_FAIL; msg:IOError: Not yet implemented: Unsupported encoding.. filename: s3://dump/obt-1b/transactions/-dwQvzQDsxUdY3LhX2-Yhg==.parquet
@derekperkins @alexandros-kyriakides What's the version are you using?
The version I had used was 3.1.5-5d8438a.
will be supported in 3.3.1 https://github.com/StarRocks/starrocks/pull/47407
Steps to reproduce the behavior
UseDeltaBinaryPackedEncoding
option by setting it tofalse
andtrue
, respectively:The two files created can be found here: example_files.tar.gz
Ran SELECT on the file with PLAIN encoding:
Ran SELECT on the file with DELTA_BINARY_PACKED encoding:
Expected behavior
It was expected that both SELECT statements would run without an error and return the number of records in each file.
Real behavior
For the file using PLAIN encoding, there was no error (expected behavior).
For the file using DELTA_BINARY_PACKED encoding, there was an error (unexpected behavior).
StarRocks version