apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.14k stars 415 forks source link

[VL] Results mismatch when scan low version orc file #6673

Open NEUpanning opened 1 month ago

NEUpanning commented 1 month ago

Backend

VL (Velox)

Bug description

SparkSQL:

SELECT if(user_type <> -1 ,user_id ,null) as a
from table
where partition_date='2024-07-01' order by a desc limit 10; 

Gluten Result:

gluten
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL

Vanilla Result

vanilla
dp_765265243
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_3779112707
dp_3778736486
dp_3778655687
dp_3778588244

Physical Plan:

== Physical Plan ==
VeloxColumnarToRowExec
+- TakeOrderedAndProjectExecTransformer (limit=10, orderBy=[a#0 DESC NULLS LAST], output=[a#0])
   +- ^(1) ProjectExecTransformer [if (NOT (user_type#6L = -1)) user_id#1 else null AS a#0]
      +- ^(1) NativeFileScan orc table[user_id#1,user_type#6L,partition_date#18] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex(1 paths)[viewfs://******, PartitionFilters: [isnotnull(partition_date#18), (partition_date#18 = 2024-07-01)], PushedFilters: [], ReadSchema: struct<user_id:string,user_type:bigint>

Unfortunately, I can't reproduce it with new hive table. I tried to create a new table that contains rows in original table and submit a same SQL to Spark and even the physical plan is same as before. But the result of gluten is same as vanilla spark.

Spark version

None

Spark configurations

No response

System information

v1.2.0 rc1

Relevant logs

No response

NEUpanning commented 1 month ago

cc @kecookier

Yohahaha commented 1 month ago

Unfortunately, I can't reproduce it with new hive table. I tried to create a new table that contains rows in original table and submit a same SQL to Spark and even the physical plan is same as before. But the result of gluten is same as vanilla spark.

would you log RowVector in TableScan#getOutput to check if this issue caused by scan orc?

Z1Wu commented 1 month ago
Unfortunately, I can't reproduce it with new hive table.

would you check the orc file schema of the old hive table, maybe it is the problem of orc file. ( hive --orcfiledump ) Issue about orc scan : https://github.com/apache/incubator-gluten/issues/5638.

NEUpanning commented 1 month ago

@Yohahaha This table contains 10365356 rows. It's tricky to log RowVector.

NEUpanning commented 1 month ago

@Z1Wu It looks like table schema is same.(DESCRIBE FORMATTED ) old table:

col_name    data_type   comment
user_id string  
user_type   bigint  
partition_date  string  

# Detailed Table Information
Table   old_table
Type    MANAGED
Provider    hive
Serde Library   org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat    org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Storage Properties  [serialization.format=1]
Partition Provider  Catalog

new table:

col_name    data_type   comment
user_type   bigint
user_id string

# Detailed Table Information
Table   panning_if_expression
Owner   hive
Type    MANAGED
Provider    hive
Serde Library   org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat    org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Storage Properties  [serialization.format=1]
Partition Provider  Catalog
Yohahaha commented 1 month ago

@Yohahaha This table contains 10365356 rows. It's tricky to log RowVector.

you can set spark.gluten.sql.debug=true and find json plan and json scan split of one task, then use GenericBenchmark.cc to run single velox task.

Z1Wu commented 1 month ago

@Z1Wu It looks like table schema is same.(DESCRIBE FORMATTED ) old table:

Hive orc table have table schema and its orc data file should also contain schema too, but orc data file written by some old engine(like hive-1.x) contains incomplete schema ( lack of column name).

For a hive orc table create by :

CREATE TABLE `test_orc_table_hive_gluten`(
  `id` int,
  `name` string)
PARTITIONED BY (
  `dt` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';

You can get orc data file schema using this command :

# hive --orcfiledump <your orc data file>
hive --orcfiledump hdfs:///data/warehouse2/test_orc_table_hive_gluten/dt=20240728/000000_0

Malformed orc schema output looks like below. Orc file with schema like Type: struct<_col0:int,_col1:string> can't be read by gluten. Result will always be NULL. Expected orc file schema should be Type: struct<id:int,name:string>

File Version: 0.12 with HIVE_8732
24/08/01 15:28:01 INFO orc.ReaderImpl: Reading ORC rows from hdfs://data/warehouse2/test_orc_table_hive_gluten/dt=20240728/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 2
Compression: SNAPPY
Compression size: 262144
Type: struct<_col0:int,_col1:string>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 2 hasNull: false
    Column 1: count: 2 hasNull: false min: 1 max: 2 sum: 3
    Column 2: count: 2 hasNull: false min: a max: b sum: 2
NEUpanning commented 1 month ago

@Z1Wu Thanks for your clarification. It looks like the old table lacks of column name. The old table orc file schema is

Type: struct<_col0:bigint,_col1:string>

The new table orc file schema is

Type: struct<user_type:bigint,user_id:string>
Z1Wu commented 1 month ago
SELECT if(user_type <> -1 ,user_id ,null) as a
from table
where partition_date='2024-07-01' order by a desc limit 10; 

Gluten Result:

gluten
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL

Vanilla Result

vanilla
dp_765265243
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_3779112707
dp_3778736486
dp_3778655687
dp_3778588244

If the the old table's orc files lack of column name, it may be the same problem as https://github.com/apache/incubator-gluten/issues/5638

You can set spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true to fallback to vanilla orc scan operator as a workaround , but it may cause performance degradation.

NEUpanning commented 1 month ago

You can set spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true to fallback to vanilla orc scan operator as a workaround

@Z1Wu @Yohahaha Thanks for your help. I will try to fallback orc scan for determining if this issue caused by native orc scan.