StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
8.98k stars 1.8k forks source link

query error from HUDI external table with parquet file format and lzo compression #4559

Closed tiannan-sr closed 2 years ago

tiannan-sr commented 2 years ago

Steps to reproduce the behavior (Required)

  1. create hudi table with parquet file format and lzo compression, and insert data:

    HUDI_CONF:
    hoodie.table.base.file.format=PARQUET
    hoodie.parquet.compression.codec=LZO
    
    create table hudi_parquet_lzo(
    uuid int,
    col_boolean boolean,
    col_int int,
    col_long long,
    col_float float,
    col_double double,
    col_decimal decimal(38,18),
    col_date date,
    col_timestamp timestamp,
    col_string string,
    col_binary binary,
    col_array array<int>,
    col_struct struct<a:string,b:int>,
    col_map map<string,int>
    ) USING hudi 
    TBLPROPERTIES(
    type = 'cow',
    primarykey = 'uuid');
    
    insert into hudi_parquet_lzo select 1,true,1,1.01,1.001,1.0001,10000000000.0000000001,cast('2020-01-01' as date),cast('2020-01-01 00:00:01' as timestamp),'Top 10 Unsolved Mysteries of Paleontological Dinosaurs, Did You Know?',cast('1110001010101011001001' as binary),array(1,10,100),null,null;
    insert into hudi_parquet_lzo select 2,false,2,2.02,2.002,2.0002,20000000000.0000000002,cast('2020-02-01' as date),cast('2020-02-01 00:00:02' as timestamp),'Xi Jinping, general secretary of the Communist Party of China Central Committee, called on Tuesday for unrelenting efforts in exercising full and rigorous governance over the Party, saying that the CPC will continue to show zero tolerance for corruption.',cast('1110001010101011001001' as binary),array(2,20,200),null,null;
    insert into hudi_parquet_lzo select 3,true,3,3.03,3.003,3.0003,30000000000.0000000003,cast('2020-03-01' as date),cast('2020-03-01 00:00:04' as timestamp),'Nation raises caution on overseas packages',cast('1110001010101011001001' as binary),array(3,30,300),null,null;
    insert into hudi_parquet_lzo select 4,false,4,4.04,4.004,4.0004,40000000000.0000000004,cast('2020-04-01' as date),cast('2020-04-01 00:00:04' as timestamp),'Shanghai, Shenzhen register most newly listed firms in 2021',cast('1110001010101011001001' as binary),array(4,40,400),null,null;
    insert into hudi_parquet_lzo select 5,true,5,5.05,5.005,5.0005,50000000000.0000000005,cast('2020-05-01' as date),cast('2020-05-01 00:00:05' as timestamp),'A total of 4,685 companies have been listed on the A-share market in China as of Dec 31, 2021, with 46 percent of them based in Beijing, Shanghai, Shenzhen, Hangzhou, Suzhou, Guangzhou, Ningbo, Nanjing, Wuxi, and Chengdu, said a report of National Business Daily on Wednesday.',cast('1110001010101011001001' as binary),array(5,50,500),null,null;
    insert into hudi_parquet_lzo select 6,false,6,6.06,6.006,6.0006,60000000000.0000000006,cast('2020-06-01' as date),cast('2020-06-01 00:00:06' as timestamp),'The development of enterprises, as well as the push of local governments, has boosted the listings, the report said. For example, in early 2018, Central China''s Hubei province launched a plan to double its listed companies by including the listing of enterprises into the government''s annual performance assessment.',cast('1110001010101011001001' as binary),array(6,60,600),null,null;
    insert into hudi_parquet_lzo select 7,true,7,7.07,7.007,7.0007,70000000000.0000000007,cast('2020-07-01' as date),cast('2020-07-01 00:00:07' as timestamp),'Most listed companies in the top 10 cities come from industries such as information technology, electronics, mechanical equipment, medical biology, and electric power equipment, with information technology bearing most listed companies.',cast('1110001010101011001001' as binary),array(7,70,700),null,null;
    insert into hudi_parquet_lzo select 8,false,8,8.08,8.008,8.0008,80000000000.0000000008,cast('2020-08-01' as date),cast('2020-08-01 00:00:08' as timestamp),'Citi report finds MNC mood in China buoyant',cast('1110001010101011001001' as binary),array(8,80,800),null,null;
    insert into hudi_parquet_lzo select 9,true,9,9.09,9.009,9.0009,90000000000.0000000009,cast('2020-09-01' as date),cast('2020-09-01 00:00:09' as timestamp),'Xi: China, Russia major champions of multilateralism, global justice',cast('1110001010101011001001' as binary),array(9,90,900),null,null;
    insert into hudi_parquet_lzo select 10,false,10,10.10,10.010,10.00010,100000000000.00000000010,cast('2020-10-01' as date),cast('2020-10-01 00:00:10' as timestamp),'Volkswagen deliveries hit by chip shortages',cast('1110001010101011001001' as binary),array(10,100,1000),null,null;
    insert into hudi_parquet_lzo select 11,true,11,11.11,11.011,11.00011,110000000000.00000000011,cast('2020-11-01' as date),cast('2020-11-01 00:00:11' as timestamp),'Light installations featuring the tiger, the Chinese zodiac animal for the upcoming lunar year, will be set up in the main plaza.',cast('1110001010101011001001' as binary),array(11,110,1100),null,null;
    insert into hudi_parquet_lzo select 12,false,12,12.12,12.012,12.00012,120000000000.00000000012,cast('2020-12-01' as date),cast('2020-10-01 00:00:10' as timestamp),'The annual lantern show at Yuyuan Garden, a historic tourist site in Shanghai, will kick off on Tuesday and run through Feb 28.',cast('1110001010101011001001' as binary),array(12,120,1200),null,null;
    insert into hudi_parquet_lzo select 13,null,null,null,null,null,null,null,null,null,null,null,null,null;
    insert into hudi_parquet_lzo select 14,null,null,null,null,null,12345678901234567890.123456789012345678,null,null,'',null,null,null,null;
  2. create hudi external table:

    create external table ex_hudi_tbl0 (
    uuid int
    ,col_boolean boolean
    ,col_int int
    ,col_long bigint
    ,col_float float
    ,col_double double
    ,col_decimal decimal(38,18)
    ,col_date date
    ,col_string string
    ,col_binary varchar(200))
    ENGINE=hudi
    properties (
    "resource" = "hudi_emr_tn",
    "table" = "hudi_parquet_lzo",
    "database" = "hudi_db");
  3. select from hudi external table:

    select col_boolean,col_int,col_long,col_float,col_double,col_decimal,col_date,col_string,col_binary from ex_hudi_tbl0 order by col_int limit 5
    execute sql error and messages: (1064, 'unknown compression type(11)'), and time is: 2022-03-29 19:52:41

Expected behavior (Required)

lzo compression surported

Real behavior (Required)

unknown compression type(11)

StarRocks version (Required)

miomiocat commented 2 years ago

i fill fix it

miomiocat commented 2 years ago

currently, the lzo compression type is not supported by parquet format starrocks external table so hudi is, but we can add this feature to the future plan

github-actions[bot] commented 2 years ago

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!