Open NEUpanning opened 1 month ago
cc @kecookier
Unfortunately, I can't reproduce it with new hive table. I tried to create a new table that contains rows in original table and submit a same SQL to Spark and even the physical plan is same as before. But the result of gluten is same as vanilla spark.
would you log RowVector in TableScan#getOutput to check if this issue caused by scan orc?
Unfortunately, I can't reproduce it with new hive table.
would you check the orc file schema of the old hive table, maybe it is the problem of orc file. ( hive --orcfiledump
@Yohahaha This table contains 10365356 rows. It's tricky to log RowVector.
@Z1Wu It looks like table schema is same.(DESCRIBE FORMATTED
col_name data_type comment
user_id string
user_type bigint
partition_date string
# Detailed Table Information
Table old_table
Type MANAGED
Provider hive
Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Storage Properties [serialization.format=1]
Partition Provider Catalog
new table:
col_name data_type comment
user_type bigint
user_id string
# Detailed Table Information
Table panning_if_expression
Owner hive
Type MANAGED
Provider hive
Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Storage Properties [serialization.format=1]
Partition Provider Catalog
@Yohahaha This table contains 10365356 rows. It's tricky to log RowVector.
you can set spark.gluten.sql.debug=true
and find json plan and json scan split of one task, then use GenericBenchmark.cc to run single velox task.
@Z1Wu It looks like table schema is same.(DESCRIBE FORMATTED
) old table:
Hive orc table have table schema and its orc data file should also contain schema too, but orc data file written by some old engine(like hive-1.x) contains incomplete schema ( lack of column name).
For a hive orc table create by :
CREATE TABLE `test_orc_table_hive_gluten`(
`id` int,
`name` string)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
You can get orc data file schema using this command :
# hive --orcfiledump <your orc data file>
hive --orcfiledump hdfs:///data/warehouse2/test_orc_table_hive_gluten/dt=20240728/000000_0
Malformed orc schema output looks like below. Orc file with schema like Type: struct<_col0:int,_col1:string>
can't be read by gluten. Result will always be NULL. Expected orc file schema should be Type: struct<id:int,name:string>
File Version: 0.12 with HIVE_8732
24/08/01 15:28:01 INFO orc.ReaderImpl: Reading ORC rows from hdfs://data/warehouse2/test_orc_table_hive_gluten/dt=20240728/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 2
Compression: SNAPPY
Compression size: 262144
Type: struct<_col0:int,_col1:string>
Stripe Statistics:
Stripe 1:
Column 0: count: 2 hasNull: false
Column 1: count: 2 hasNull: false min: 1 max: 2 sum: 3
Column 2: count: 2 hasNull: false min: a max: b sum: 2
@Z1Wu Thanks for your clarification. It looks like the old table lacks of column name. The old table orc file schema is
Type: struct<_col0:bigint,_col1:string>
The new table orc file schema is
Type: struct<user_type:bigint,user_id:string>
SELECT if(user_type <> -1 ,user_id ,null) as a from table where partition_date='2024-07-01' order by a desc limit 10;
Gluten Result:
gluten NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
Vanilla Result
vanilla dp_765265243 dp_71942892 dp_71942892 dp_71942892 dp_71942892 dp_71942892 dp_3779112707 dp_3778736486 dp_3778655687 dp_3778588244
If the the old table's orc files lack of column name, it may be the same problem as https://github.com/apache/incubator-gluten/issues/5638
You can set spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true
to fallback to vanilla orc scan operator as a workaround , but it may cause performance degradation.
You can set spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true to fallback to vanilla orc scan operator as a workaround
@Z1Wu @Yohahaha Thanks for your help. I will try to fallback orc scan for determining if this issue caused by native orc scan.
Backend
VL (Velox)
Bug description
SparkSQL:
Gluten Result:
Vanilla Result
Physical Plan:
Unfortunately, I can't reproduce it with new hive table. I tried to create a new table that contains rows in original table and submit a same SQL to Spark and even the physical plan is same as before. But the result of gluten is same as vanilla spark.
Spark version
None
Spark configurations
No response
System information
v1.2.0 rc1
Relevant logs
No response