StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
8.98k stars 1.8k forks source link

result error from hudi external table with orc file system #4753

Closed tiannan-sr closed 2 years ago

tiannan-sr commented 2 years ago

Steps to reproduce the behavior (Required)

  1. create hudi table, and insert data:

    create table hudi_par_orc_lzo(
    uuid int,
    col_boolean boolean,
    col_int int,
    col_long long,
    col_float float,
    col_double double,
    col_decimal decimal(38,18),
    col_date date,
    col_timestamp timestamp,
    col_string string,
    col_binary binary,
    col_array array<int>,
    col_struct struct<a:string,b:int>,
    col_map map<string,int>
    ) USING hudi 
    TBLPROPERTIES(
    type = "cow",
    primarykey = 'uuid')
    partitioned by(col_date,col_int);
    
    insert into hudi_par_orc_lzo partition(col_date = '2020-01-01', col_int='1') select 1,true,1.01,1.001,1.0001,10000000000.0000000001,cast('2020-01-01 00:00:01' as timestamp),'Top 10 Unsolved Mysteries of Paleontological Dinosaurs, Did You Know?',cast('1110001010101011001001' as binary),array(1,10,100),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-01-01', col_int='2') select 1,true,1.01,1.001,1.0001,10000000000.0000000001,cast('2020-01-01 00:00:01' as timestamp),'Top 10 Unsolved Mysteries of Paleontological Dinosaurs, Did You Know?',cast('1110001010101011001001' as binary),array(1,10,100),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-02-01', col_int='2') select 2,false,2.02,2.002,2.0002,20000000000.0000000002,cast('2020-02-01 00:00:02' as timestamp),'Xi Jinping, general secretary of the Communist Party of China Central Committee, called on Tuesday for unrelenting efforts in exercising full and rigorous governance over the Party, saying that the CPC will continue to show zero tolerance for corruption.',cast('1110001010101011001001' as binary),array(2,20,200),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-02-01', col_int='3') select 2,false,2.02,2.002,2.0002,20000000000.0000000002,cast('2020-02-01 00:00:02' as timestamp),'Xi Jinping, general secretary of the Communist Party of China Central Committee, called on Tuesday for unrelenting efforts in exercising full and rigorous governance over the Party, saying that the CPC will continue to show zero tolerance for corruption.',cast('1110001010101011001001' as binary),array(2,20,200),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-03-01', col_int='3') select 3,true,3.03,3.003,3.0003,30000000000.0000000003,cast('2020-03-01 00:00:04' as timestamp),'Nation raises caution on overseas packages',cast('1110001010101011001001' as binary),array(3,30,300),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-03-01', col_int='4') select 3,true,3.03,3.003,3.0003,30000000000.0000000003,cast('2020-03-01 00:00:04' as timestamp),'Nation raises caution on overseas packages',cast('1110001010101011001001' as binary),array(3,30,300),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-04-01', col_int='4') select 4,false,4.04,4.004,4.0004,40000000000.0000000004,cast('2020-04-01 00:00:04' as timestamp),'Shanghai, Shenzhen register most newly listed firms in 2021',cast('1110001010101011001001' as binary),array(4,40,400),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-04-01', col_int='5') select 4,false,4.04,4.004,4.0004,40000000000.0000000004,cast('2020-04-01 00:00:04' as timestamp),'Shanghai, Shenzhen register most newly listed firms in 2021',cast('1110001010101011001001' as binary),array(4,40,400),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-05-01', col_int='5') select 5,true,5.05,5.005,5.0005,50000000000.0000000005,cast('2020-05-01 00:00:05' as timestamp),'A total of 4,685 companies have been listed on the A-share market in China as of Dec 31, 2021, with 46 percent of them based in Beijing, Shanghai, Shenzhen, Hangzhou, Suzhou, Guangzhou, Ningbo, Nanjing, Wuxi, and Chengdu, said a report of National Business Daily on Wednesday.',cast('1110001010101011001001' as binary),array(5,50,500),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-05-01', col_int='6') select 5,true,5.05,5.005,5.0005,50000000000.0000000005,cast('2020-05-01 00:00:05' as timestamp),'A total of 4,685 companies have been listed on the A-share market in China as of Dec 31, 2021, with 46 percent of them based in Beijing, Shanghai, Shenzhen, Hangzhou, Suzhou, Guangzhou, Ningbo, Nanjing, Wuxi, and Chengdu, said a report of National Business Daily on Wednesday.',cast('1110001010101011001001' as binary),array(5,50,500),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-06-01', col_int='6') select 6,false,6.06,6.006,6.0006,60000000000.0000000006,cast('2020-06-01 00:00:06' as timestamp),'The development of enterprises, as well as the push of local governments, has boosted the listings, the report said. For example, in early 2018, Central China''s Hubei province launched a plan to double its listed companies by including the listing of enterprises into the government''s annual performance assessment.',cast('1110001010101011001001' as binary),array(6,60,600),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-06-01', col_int='7') select 6,false,6.06,6.006,6.0006,60000000000.0000000006,cast('2020-06-01 00:00:06' as timestamp),'The development of enterprises, as well as the push of local governments, has boosted the listings, the report said. For example, in early 2018, Central China''s Hubei province launched a plan to double its listed companies by including the listing of enterprises into the government''s annual performance assessment.',cast('1110001010101011001001' as binary),array(6,60,600),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-07-01', col_int='7') select 7,true,7.07,7.007,7.0007,70000000000.0000000007,cast('2020-07-01 00:00:07' as timestamp),'Most listed companies in the top 10 cities come from industries such as information technology, electronics, mechanical equipment, medical biology, and electric power equipment, with information technology bearing most listed companies.',cast('1110001010101011001001' as binary),array(7,70,700),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-07-01', col_int='8') select 7,true,7.07,7.007,7.0007,70000000000.0000000007,cast('2020-07-01 00:00:07' as timestamp),'Most listed companies in the top 10 cities come from industries such as information technology, electronics, mechanical equipment, medical biology, and electric power equipment, with information technology bearing most listed companies.',cast('1110001010101011001001' as binary),array(7,70,700),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-08-01', col_int='8') select 8,false,8.08,8.008,8.0008,80000000000.0000000008,cast('2020-08-01 00:00:08' as timestamp),'Citi report finds MNC mood in China buoyant',cast('1110001010101011001001' as binary),array(8,80,800),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-08-01', col_int='9') select 8,false,8.08,8.008,8.0008,80000000000.0000000008,cast('2020-08-01 00:00:08' as timestamp),'Citi report finds MNC mood in China buoyant',cast('1110001010101011001001' as binary),array(8,80,800),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-09-01', col_int='9') select 9,true,9.09,9.009,9.0009,90000000000.0000000009,cast('2020-09-01 00:00:09' as timestamp),'Xi: China, Russia major champions of multilateralism, global justice',cast('1110001010101011001001' as binary),array(9,90,900),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-09-01', col_int='10') select 9,true,9.09,9.009,9.0009,90000000000.0000000009,cast('2020-09-01 00:00:09' as timestamp),'Xi: China, Russia major champions of multilateralism, global justice',cast('1110001010101011001001' as binary),array(9,90,900),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-10-01', col_int='10') select 10,false,10.10,10.010,10.00010,100000000000.00000000010,cast('2020-10-01 00:00:10' as timestamp),'Volkswagen deliveries hit by chip shortages',cast('1110001010101011001001' as binary),array(10,100,1000),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-10-01', col_int='11') select 10,false,10.10,10.010,10.00010,100000000000.00000000010,cast('2020-10-01 00:00:10' as timestamp),'Volkswagen deliveries hit by chip shortages',cast('1110001010101011001001' as binary),array(10,100,1000),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-11-01', col_int='11') select 11,true,11.11,11.011,11.00011,110000000000.00000000011,cast('2020-11-01 00:00:11' as timestamp),'Light installations featuring the tiger, the Chinese zodiac animal for the upcoming lunar year, will be set up in the main plaza.',cast('1110001010101011001001' as binary),array(11,110,1100),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-11-01', col_int='12') select 11,true,11.11,11.011,11.00011,110000000000.00000000011,cast('2020-11-01 00:00:11' as timestamp),'Light installations featuring the tiger, the Chinese zodiac animal for the upcoming lunar year, will be set up in the main plaza.',cast('1110001010101011001001' as binary),array(11,110,1100),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-12-01', col_int='12') select 12,false,12.12,12.012,12.00012,120000000000.00000000012,cast('2020-10-01 00:00:10' as timestamp),'The annual lantern show at Yuyuan Garden, a historic tourist site in Shanghai, will kick off on Tuesday and run through Feb 28.',cast('1110001010101011001001' as binary),array(12,120,1200),null,null;
    insert into hudi_par_orc_lzo partition(col_date = '2020-12-01', col_int='13') select 12,false,12.12,12.012,12.00012,120000000000.00000000012,cast('2020-10-01 00:00:10' as timestamp),'The annual lantern show at Yuyuan Garden, a historic tourist site in Shanghai, will kick off on Tuesday and run through Feb 28.',cast('1110001010101011001001' as binary),array(12,120,1200),null,null;
    insert into hudi_par_orc_lzo partition(col_date = null, col_int=null) select 13,null,null,null,null,null,null,null,null,null,null,null;
    insert into hudi_par_orc_lzo partition(col_date = null, col_int=null) select 14,null,null,null,null,12345678901234567890.123456789012345678,null,'',null,null,null,null;
  2. create hudi external table:

    create external table ex_hudi_tbl1_orc_int (
    col_int int null comment "column int"
    ,col_date date
    ,col_string varchar(5)
    ,col_decimal decimal(38, 18)
    ,col_boolean boolean null comment "column boolean")
    ENGINE=hudi
    properties (
    "resource" = "hudi_emr_tn",
    "table" = "hudi_par_orc_lzo",
    "database" = "hudi_db");
  3. query result error:

    
    mysql> select col_int from ex_hudi_tbl1_orc_int limit 2;
    Empty set (0.03 sec)

mysql> select col_int,col_date from ex_hudi_tbl1_orc_int limit 2; Empty set (0.04 sec)

mysql> select col_int,col_date,col_string from ex_hudi_tbl1_orc_int limit 2; +---------+------------+-----------------------------------------------------------------------------------------------------------------------------------+ | col_int | col_date | col_string | +---------+------------+-----------------------------------------------------------------------------------------------------------------------------------+ | 12 | 2020-11-01 | Light installations featuring the tiger, the Chinese zodiac animal for the upcoming lunar year, will be set up in the main plaza. | | 8 | 2020-08-01 | Citi report finds MNC mood in China buoyant | +---------+------------+-----------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set (0.02 sec)


### Expected behavior (Required)
returned right results

### Real behavior (Required)
result error

### StarRocks version (Required)
 - You can get the StarRocks version by executing SQL `select current_version()`

mysql> select current_version(); +------------------------+ | current_version() | +------------------------+ | QA_TEST_MASTER 35e3df8 | +------------------------+ 1 row in set (0.00 sec)

miomiocat commented 2 years ago

i will fix it

tiannan-sr commented 2 years ago
create external table ex_hudi_tbl8 (
                    uuid int
                    ,col_boolean boolean
                    ,col_int int
                    ,col_long bigint
                    ,col_float float
                    ,col_double double
                    ,col_decimal decimal(38,18)
                    ,col_date date
                    ,col_string string
                    ,col_binary varchar(200))
                ENGINE=hudi
                properties (
                    "resource" = "hudi_resource_1648890105966",
                    "table" = "hudi_par_orc_lzo",
                    "database" = "hudi_db")
select count(*) from ex_hudi_tbl8 where col_int >= 10
((0,),)
select count(*) from ex_hudi_tbl8 where col_date not in('2020-12-01','2020-11-01')
((0,),)

in these tow cases, when col_int and col_date as where condition, no rows returned. and it should be return the rows.