StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
8.94k stars 1.79k forks source link

File external table - s3://parquet - fails to parse but insert into from files works #46838

Open x2ocoder opened 5 months ago

x2ocoder commented 5 months ago

SUCCEEDS: Loading from s3://parquet using "INSERT INTO .. FROM FILES ( "format" = "parquet", "aws.s3....") SUCCEEDS: SELECT * FROM FILES ( "path" = "s3://starrocks-examples/user_behavior_ten_million_rows.parquet",

FAILS: Querying the exact same s3://parquet file using CREATE EXTERNAL TABLE

Since the parquet file loads properly using INSERT INTO FROM FILES, how can the same parquet file have an invalid format for CREATE EXTERNAL TABLE?

Steps to reproduce the behavior (Required)

CREATE EXTERNAL TABLE dummy
(
    Method VARCHAR(50) NULL,
    Path VARCHAR(50) NULL
)
ENGINE=file
PROPERTIES 
(       
        "format" = "parquet",
        "enable_recursive_listing" = "true",
        "path" = "s3://x2orocks-apne1/ginlogs/2024/06/11/dummy.parquet",
)

StarRocks > select * from dummy limit 10;
ERROR 1064 (HY000): FileReader::get_next failed. reason = Corruption: Failed to decode parquet page header, page header's size is out of range.  allowed_page_size=0, max_page_size=16777216, offset=39, finish_offset=39
be/src/formats/parquet/column_chunk_reader.cpp:104 _page_reader->next_header()
be/src/formats/parquet/column_chunk_reader.cpp:63 _parse_page_header()
be/src/formats/parquet/stored_column_reader.cpp:466 _reader->load_header()
be/src/formats/parquet/stored_column_reader.cpp:331 _next_page()
Duckdb has no problem reading the same parquet file
parquet file created using parquet-go v0.22.0

Sample file that works on INSERT INTO but fails on CREATE EXTERNAL TABLE+SELECT dummy.parquet.zip Created with gist: https://gist.github.com/x2ocoder/724016a1c3b7c2635be4079a4c11cd43

Contents:``` │ method │ path │ │ varchar │ varchar │ │ GET │ /home │

Could be related to: 
https://github.com/StarRocks/starrocks/blob/main/be/src/formats/parquet/page_reader.cpp#L55-L60
size_t allowed_page_size = kDefaultPageHeaderSize;
size_t remaining = _finish_offset - _offset;
uint32_t header_length = 0;

RETURN_IF_ERROR(_stream->seek(_offset));

do {
    allowed_page_size = std::min(std::min(allowed_page_size, remaining), kMaxPageHeaderSize);


Should the do {} loop be skipped if remaining == 0 instead of crashing?

NOTE: duckdb created sample3.parquet does not have a problem 
https://filesampleshub.com/download/code/parquet/sample3.parquet

### Expected behavior (Required)

External file should be queried exactly the same as when INSERT INTO FROM FILES

### Real behavior (Required)

ERROR 1064 (HY000): FileReader::get_next failed. reason = Corruption: Failed to decode parquet page header, page header's size is out of range.  allowed_page_size=0, max_page_size=16777216, offset=396308, finish_offset=396308

### StarRocks version (Required)
Same problem with both:
docker pull starrocks/allin1-ubuntu:3.1-latest
docker pull starrocks/allin1-ubuntu:3.3.0-rc02
chaoyli commented 3 months ago

FILE EXTERNAL TABLE is deprecated. You can use the select * from files always @x2ocoder