The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
SUCCEEDS: Loading from s3://parquet using "INSERT INTO .. FROM FILES ( "format" = "parquet", "aws.s3....")
SUCCEEDS: SELECT * FROM FILES ( "path" = "s3://starrocks-examples/user_behavior_ten_million_rows.parquet",
FAILS: Querying the exact same s3://parquet file using CREATE EXTERNAL TABLE
Since the parquet file loads properly using INSERT INTO FROM FILES, how can the same parquet file have an invalid format for CREATE EXTERNAL TABLE?
Steps to reproduce the behavior (Required)
CREATE EXTERNAL TABLE dummy
(
Method VARCHAR(50) NULL,
Path VARCHAR(50) NULL
)
ENGINE=file
PROPERTIES
(
"format" = "parquet",
"enable_recursive_listing" = "true",
"path" = "s3://x2orocks-apne1/ginlogs/2024/06/11/dummy.parquet",
)
StarRocks > select * from dummy limit 10;
ERROR 1064 (HY000): FileReader::get_next failed. reason = Corruption: Failed to decode parquet page header, page header's size is out of range. allowed_page_size=0, max_page_size=16777216, offset=39, finish_offset=39
be/src/formats/parquet/column_chunk_reader.cpp:104 _page_reader->next_header()
be/src/formats/parquet/column_chunk_reader.cpp:63 _parse_page_header()
be/src/formats/parquet/stored_column_reader.cpp:466 _reader->load_header()
be/src/formats/parquet/stored_column_reader.cpp:331 _next_page()
Duckdb has no problem reading the same parquet file
parquet file created using parquet-go v0.22.0
Should the do {} loop be skipped if remaining == 0 instead of crashing?
NOTE: duckdb created sample3.parquet does not have a problem
https://filesampleshub.com/download/code/parquet/sample3.parquet
### Expected behavior (Required)
External file should be queried exactly the same as when INSERT INTO FROM FILES
### Real behavior (Required)
ERROR 1064 (HY000): FileReader::get_next failed. reason = Corruption: Failed to decode parquet page header, page header's size is out of range. allowed_page_size=0, max_page_size=16777216, offset=396308, finish_offset=396308
### StarRocks version (Required)
Same problem with both:
docker pull starrocks/allin1-ubuntu:3.1-latest
docker pull starrocks/allin1-ubuntu:3.3.0-rc02
SUCCEEDS: Loading from s3://parquet using "INSERT INTO .. FROM FILES ( "format" = "parquet", "aws.s3....") SUCCEEDS: SELECT * FROM FILES ( "path" = "s3://starrocks-examples/user_behavior_ten_million_rows.parquet",
FAILS: Querying the exact same s3://parquet file using CREATE EXTERNAL TABLE
Since the parquet file loads properly using INSERT INTO FROM FILES, how can the same parquet file have an invalid format for CREATE EXTERNAL TABLE?
Steps to reproduce the behavior (Required)
Sample file that works on INSERT INTO but fails on CREATE EXTERNAL TABLE+SELECT dummy.parquet.zip Created with gist: https://gist.github.com/x2ocoder/724016a1c3b7c2635be4079a4c11cd43
Contents:``` │ method │ path │ │ varchar │ varchar │ │ GET │ /home │