StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.41k stars 1.7k forks source link

ORC reader will return wrong result when data type is tinyint #44395

Open caneGuy opened 3 months ago

caneGuy commented 3 months ago

Steps to reproduce the behavior (Required)

  1. CREATE TABLE '...'

    CREATE TABLE `test.app_sh01_cube_exp_post_did_analysis_new_user`(
    `gid` string,
    `appid` string,
    `exp_start_date` string, 
    `exp_end_date` string,
    `first_enter_exp_ds` string ,
    `exp_group` string,
    `is_new` string ,
    `platform` string ,
    `ds` string,
    `cube_source_type` string ,
    `base_user_cnt` int ,
    `active_user_cnt` int ,
    `online_time_s` int ,
    `attempt_cnt` int ,
    `success_attempt_cnt` int ,
    `use_gold_coin_resource_num` int ,
    `permananet_item_consume_num` int ,
    `plus5_consume_num` int ,
    `get_pre_stage_limit_time_item_resource_num` int ,
    `get_energy_resource_num` int ,
    `pay_users_cnt` int ,
    `pay_amount_amt` double ,
    `get_pre_stage_limit_item_resource_num` bigint ,
    `use_pre_stage_limit_item_resource_num` bigint ,
    `use_stage_end_add_fivestep_resource_num` bigint ,
    `use_limited_add_fivestep_resource_num` bigint ,
    `get_gold_coin_resource_num` bigint ,
    `get_item_resource_num` bigint ,
    `get_limited_add_fivestep_resource_num` bigint ,
    `gold_coin_resource_stock_num` bigint ,
    `item_resource_stock_num` bigint ,
    `last7dasy_use_gold_coin_resource_num` bigint ,
    `last7dasy_use_item_resource_num` bigint ,
    `stage_fuuu_num` bigint ,
    `play_fail_or_use_addstep_success_num` bigint )
    PARTITIONED BY (
    `report_id` string)
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
    STORED AS INPUTFORMAT
    'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
    LOCATION
    'xxx'
  2. data file offline @GavinMar

  3. SELECT  '....'
    SELECT DATEDIFF(DATE(ds),DATE(first_enter_exp_ds)) observe_window,
       gid,
          exp_group,
          '0' sql_label,
              active_user_cnt
    FROM test.app_sh01_cube_exp_post_did_analysis_new_user
    WHERE report_id='4865547759250313217'
     AND ds>='2024-03-28'
     AND ds<='2024-07-03'
     AND ds>=first_enter_exp_ds
     and gid = '23876242'
     AND cube_source_type IN ('intervene')
     limit 200

    image

SELECT DATEDIFF(DATE(ds),DATE(first_enter_exp_ds)) observe_window,
       gid,
          exp_group,
          '0' sql_label,
              active_user_cnt
   FROM hive_default_catalog.test.app_sh01_cube_exp_post_did_analysis_new_user
   WHERE report_id='4865547759250313217'
     AND ds>='2024-03-28'
     AND ds<='2024-07-03'
     AND ds>=first_enter_exp_ds
     and gid = '23876242'
     AND cube_source_type IN ('intervene')
     limit 200

image

Expected behavior (Required)

Real behavior (Required)

StarRocks version (Required)

letian-jiang commented 3 months ago

By default, columns in ORC files are accessed by their ordinal position in the Hive table definition. However, the table definition and orc file schema do not match in this case.

letian-jiang commented 3 months ago
image
letian-jiang commented 3 months ago

After set orc_use_column_names = true;, the result is correct.

image
letian-jiang commented 3 months ago

related PR https://github.com/StarRocks/starrocks/pull/42777