ORC reader will return wrong result when data type is tinyint

caneGuy commented 6 months ago

Steps to reproduce the behavior (Required)

CREATE TABLE '...'

CREATE TABLE `test.app_sh01_cube_exp_post_did_analysis_new_user`(
`gid` string,
`appid` string,
`exp_start_date` string, 
`exp_end_date` string,
`first_enter_exp_ds` string ,
`exp_group` string,
`is_new` string ,
`platform` string ,
`ds` string,
`cube_source_type` string ,
`base_user_cnt` int ,
`active_user_cnt` int ,
`online_time_s` int ,
`attempt_cnt` int ,
`success_attempt_cnt` int ,
`use_gold_coin_resource_num` int ,
`permananet_item_consume_num` int ,
`plus5_consume_num` int ,
`get_pre_stage_limit_time_item_resource_num` int ,
`get_energy_resource_num` int ,
`pay_users_cnt` int ,
`pay_amount_amt` double ,
`get_pre_stage_limit_item_resource_num` bigint ,
`use_pre_stage_limit_item_resource_num` bigint ,
`use_stage_end_add_fivestep_resource_num` bigint ,
`use_limited_add_fivestep_resource_num` bigint ,
`get_gold_coin_resource_num` bigint ,
`get_item_resource_num` bigint ,
`get_limited_add_fivestep_resource_num` bigint ,
`gold_coin_resource_stock_num` bigint ,
`item_resource_stock_num` bigint ,
`last7dasy_use_gold_coin_resource_num` bigint ,
`last7dasy_use_item_resource_num` bigint ,
`stage_fuuu_num` bigint ,
`play_fail_or_use_addstep_success_num` bigint )
PARTITIONED BY (
`report_id` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'xxx'

data file offline @GavinMar

SELECT  '....'
SELECT DATEDIFF(DATE(ds),DATE(first_enter_exp_ds)) observe_window,
   gid,
      exp_group,
      '0' sql_label,
          active_user_cnt
FROM test.app_sh01_cube_exp_post_did_analysis_new_user
WHERE report_id='4865547759250313217'
 AND ds>='2024-03-28'
 AND ds<='2024-07-03'
 AND ds>=first_enter_exp_ds
 and gid = '23876242'
 AND cube_source_type IN ('intervene')
 limit 200

SELECT DATEDIFF(DATE(ds),DATE(first_enter_exp_ds)) observe_window,
       gid,
          exp_group,
          '0' sql_label,
              active_user_cnt
   FROM hive_default_catalog.test.app_sh01_cube_exp_post_did_analysis_new_user
   WHERE report_id='4865547759250313217'
     AND ds>='2024-03-28'
     AND ds<='2024-07-03'
     AND ds>=first_enter_exp_ds
     and gid = '23876242'
     AND cube_source_type IN ('intervene')
     limit 200

Expected behavior (Required)

Real behavior (Required)

StarRocks version (Required)

You can get the StarRocks version by executing SQL select current_version() 3.1.5

letian-jiang commented 6 months ago

By default, columns in ORC files are accessed by their ordinal position in the Hive table definition. However, the table definition and orc file schema do not match in this case.

letian-jiang commented 6 months ago

After set orc_use_column_names = true;, the result is correct.

letian-jiang commented 6 months ago

github-actions[bot] commented 1 week ago

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!

StarRocks / starrocks