StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.74k stars 1.75k forks source link

Error when reading parquet file using DELTA_BINARY_PACKED encoding (IOError: Not yet implemented: Unsupported encoding.) #37169

Open alexandros-kyriakides opened 9 months ago

alexandros-kyriakides commented 9 months ago

Steps to reproduce the behavior

  1. Created two parquet files and saved them to S3. One file was created using PLAIN encoding, the other file using DELTA_BINARY_PACKED encoding. The parquet-dotnet library was used to create the files. The encoding was selected using the UseDeltaBinaryPackedEncoding option by setting it to false and true, respectively:
        var parquetSchema =
            await ParquetSerializer.SerializeAsync<>(results, outputPath, new ParquetSerializerOptions()
            {
                CompressionMethod = CompressionMethod.Snappy, // This is the Default Compression Method
                ParquetOptions = new ParquetOptions()
                {
                    UseDeltaBinaryPackedEncoding = false
                }
            });

The two files created can be found here: example_files.tar.gz

  1. Ran SELECT on the file with PLAIN encoding:

    mysql> SELECT count() FROM FILES("path" = "s3://*****/example_file_plain_encoding.parquet","format" = "parquet","aws.s3.access_key" = "*****","aws.s3
  2. Ran SELECT on the file with DELTA_BINARY_PACKED encoding:

    mysql> SELECT count() FROM FILES("path" = "s3://******/example_file_delta_binary_packed_encoding.parquet","format" = "parquet","aws.s3.access_key" =
    "*****","aws.s3.secret_key" = "*****","aws.s3.region" = "us-east-1" );

Expected behavior

It was expected that both SELECT statements would run without an error and return the number of records in each file.

Real behavior

  1. For the file using PLAIN encoding, there was no error (expected behavior).

    +---------+
    | count() |
    +---------+
    |  250089 |
    +---------+
    1 row in set (1,60 sec)
  2. For the file using DELTA_BINARY_PACKED encoding, there was an error (unexpected behavior).

ERROR 1064 (HY000): IOError: Not yet implemented: Unsupported encoding.. filename: s3://*****/example_file_delta_binary_packed_encoding.parquet

StarRocks version

mysql> select current_version();
+-------------------+
| current_version() |
+-------------------+
| 3.1.5-5d8438a     |
+-------------------+
1 row in set (0,12 sec)
alexandros-kyriakides commented 9 months ago

To see the encoding of a parquet file using duckdb:

D SELECT encodings FROM parquet_metadata('example_file_plain_encoding.parquet');
┌────────────────────────┐
│       encodings        │
│        varchar         │
├────────────────────────┤
│ RLE, BIT_PACKED, PLAIN │
└────────────────────────┘
D SELECT encodings FROM parquet_metadata('example_file_delta_binary_packed_encoding.parquet');
┌──────────────────────────────────────┐
│              encodings               │
│               varchar                │
├──────────────────────────────────────┤
│ RLE, BIT_PACKED, DELTA_BINARY_PACKED │
└──────────────────────────────────────┘
D 
alexandros-kyriakides commented 9 months ago

Some extra info:

jlj-incom commented 5 months ago

It is the same for Encoding = 6 (DELTA_LENGTH_BYTE_ARRAY) as well.

Version: 3.2.4-613f0b5

derekperkins commented 3 months ago

We're encountering this storing larger text data. From the docs:

DELTA_LENGTH_BYTE_ARRAY = 6: This encoding is always preferred over PLAIN for byte array columns.

https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6

jaogoy commented 3 months ago

@derekperkins @alexandros-kyriakides What's the version are you using?

derekperkins commented 3 months ago

I used v3.3-rc2

mustafa-qamaruddin commented 2 months ago

I am on Starrocks version:

version info
Version: 3.3.0
Git: 19a3f66
Build Info: StarRocks@localhost (Ubuntu 22.04.3 LTS)
Build Time: 2024-06-21 11:48:40

And here is my encoding:

D SELECT encodings FROM parquet_metadata('C:\projs\go-generators\data\parquet\obt-1b\transactions\5US3cJJlmVsLVkwx2r0ZVw==.parquet') limit 10; 
┌─────────────────────────┐
│        encodings        │
│         varchar         │
├─────────────────────────┤
│ DELTA_LENGTH_BYTE_ARRAY │
│ DELTA_LENGTH_BYTE_ARRAY │
│ DELTA_LENGTH_BYTE_ARRAY │
│ DELTA_LENGTH_BYTE_ARRAY │
│ PLAIN                   │
│ PLAIN                   │
│ PLAIN                   │
│ PLAIN                   │
│ DELTA_LENGTH_BYTE_ARRAY │
│ DELTA_LENGTH_BYTE_ARRAY │
├─────────────────────────┤
│         10 rows         │
└─────────────────────────┘
D

And I get the same error: type:LOAD_RUN_FAIL; msg:IOError: Not yet implemented: Unsupported encoding.. filename: s3://dump/obt-1b/transactions/-dwQvzQDsxUdY3LhX2-Yhg==.parquet

alexandros-kyriakides commented 2 months ago

@derekperkins @alexandros-kyriakides What's the version are you using?

The version I had used was 3.1.5-5d8438a.

wyb commented 2 months ago

will be supported in 3.3.1 https://github.com/StarRocks/starrocks/pull/47407