StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
8.89k stars 1.78k forks source link

ORC reader will return wrong result when data type is tinyint #44395

Open caneGuy opened 6 months ago

caneGuy commented 6 months ago

Steps to reproduce the behavior (Required)

  1. CREATE TABLE '...'

    CREATE TABLE `test.app_sh01_cube_exp_post_did_analysis_new_user`(
    `gid` string,
    `appid` string,
    `exp_start_date` string, 
    `exp_end_date` string,
    `first_enter_exp_ds` string ,
    `exp_group` string,
    `is_new` string ,
    `platform` string ,
    `ds` string,
    `cube_source_type` string ,
    `base_user_cnt` int ,
    `active_user_cnt` int ,
    `online_time_s` int ,
    `attempt_cnt` int ,
    `success_attempt_cnt` int ,
    `use_gold_coin_resource_num` int ,
    `permananet_item_consume_num` int ,
    `plus5_consume_num` int ,
    `get_pre_stage_limit_time_item_resource_num` int ,
    `get_energy_resource_num` int ,
    `pay_users_cnt` int ,
    `pay_amount_amt` double ,
    `get_pre_stage_limit_item_resource_num` bigint ,
    `use_pre_stage_limit_item_resource_num` bigint ,
    `use_stage_end_add_fivestep_resource_num` bigint ,
    `use_limited_add_fivestep_resource_num` bigint ,
    `get_gold_coin_resource_num` bigint ,
    `get_item_resource_num` bigint ,
    `get_limited_add_fivestep_resource_num` bigint ,
    `gold_coin_resource_stock_num` bigint ,
    `item_resource_stock_num` bigint ,
    `last7dasy_use_gold_coin_resource_num` bigint ,
    `last7dasy_use_item_resource_num` bigint ,
    `stage_fuuu_num` bigint ,
    `play_fail_or_use_addstep_success_num` bigint )
    PARTITIONED BY (
    `report_id` string)
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
    STORED AS INPUTFORMAT
    'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
    LOCATION
    'xxx'
  2. data file offline @GavinMar

  3. SELECT  '....'
    SELECT DATEDIFF(DATE(ds),DATE(first_enter_exp_ds)) observe_window,
       gid,
          exp_group,
          '0' sql_label,
              active_user_cnt
    FROM test.app_sh01_cube_exp_post_did_analysis_new_user
    WHERE report_id='4865547759250313217'
     AND ds>='2024-03-28'
     AND ds<='2024-07-03'
     AND ds>=first_enter_exp_ds
     and gid = '23876242'
     AND cube_source_type IN ('intervene')
     limit 200

    image

SELECT DATEDIFF(DATE(ds),DATE(first_enter_exp_ds)) observe_window,
       gid,
          exp_group,
          '0' sql_label,
              active_user_cnt
   FROM hive_default_catalog.test.app_sh01_cube_exp_post_did_analysis_new_user
   WHERE report_id='4865547759250313217'
     AND ds>='2024-03-28'
     AND ds<='2024-07-03'
     AND ds>=first_enter_exp_ds
     and gid = '23876242'
     AND cube_source_type IN ('intervene')
     limit 200

image

Expected behavior (Required)

Real behavior (Required)

StarRocks version (Required)

letian-jiang commented 6 months ago

By default, columns in ORC files are accessed by their ordinal position in the Hive table definition. However, the table definition and orc file schema do not match in this case.

letian-jiang commented 6 months ago
image
letian-jiang commented 6 months ago

After set orc_use_column_names = true;, the result is correct.

image
letian-jiang commented 6 months ago

related PR https://github.com/StarRocks/starrocks/pull/42777

github-actions[bot] commented 1 week ago

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!