apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

[CH] ORC read/write mismatch when map column contains null #8021

Open taiyang-li opened 15 hours ago

taiyang-li commented 15 hours ago

Description

First execute below sql with native write enabled.

drop table if exists tmp.tnm;
create table tmp.tnm using orc as 
with data_source as (
select
id as uid,
case when random() < 0.1 then null else floor(random() * 100) end as rec_room_id,
case when random() < 0.1 then null else floor(random() * 100) end as room_id,
case when random() < 0.1 then null else floor(random() * 100) end as dispatch_id,
case when random() < 0.1 then null else floor(random() * 100) end as gift_value_total,
case when random() < 0.1 then null else floor(random() * 100) end as follow_channel,
case when random() < 0.1 then null else floor(random() * 100) end as follow_user,
case when random() < 0.1 then null else floor(random() * 100) end as followed_channel,
case when random() < 0.1 then null else floor(random() * 100) end as need_filter,
case when random() < 0.1 then null else floor(random() * 100) end as mic_time
from range(100000)
)
select
uid,
rec_room_id,
room_id,
dispatch_id,
str_to_map(
concat(
'gift_value_total:', gift_value_total,
',follow_channel:', follow_channel,
',follow_user:', follow_user,
',followed_channel:', followed_channel,
',filter:', need_filter,
',mic_time:', mic_time
)
) as label_map,
mic_time
from
data_source;

Then download the orc file and view its content. We can see that the values of mic_time and label_map['mic_time'] don't match.

{"uid": 0, "rec_room_id": 28, "room_id": null, "dispatch_id": 65, "label_map": [{"key": "gift_value_total", "value": "75"}, {"key": "follow_channel", "value": "96"}, {"key": "follow_user", "value": "45"}, {"key": "followed_channel", "value": "64"}, {"key": "filter", "value": "62"}, {"key": "mic_time", "value": "86"}], "mic_time": 86}
{"uid": 1, "rec_room_id": 90, "room_id": 55, "dispatch_id": null, "label_map": null, "mic_time": 64}
{"uid": 2, "rec_room_id": 84, "room_id": 40, "dispatch_id": 38, "label_map": [{"key": "gift_value_total", "value": "2"}, {"key": "follow_channel", "value": "54"}, {"key": "follow_user", "value": "44"}, {"key": "followed_channel", "value": "62"}, {"key": "filter", "value": "68"}, {"key": "mic_time", "value": "64"}], "mic_time": 7}
{"uid": 3, "rec_room_id": 37, "room_id": null, "dispatch_id": 54, "label_map": [{"key": "gift_value_total", "value": "22"}, {"key": "follow_channel", "value": "2"}, {"key": "follow_user", "value": "66"}, {"key": "followed_channel", "value": "69"}, {"key": "filter", "value": "34"}, {"key": "mic_time", "value": "7"}], "mic_time": 86}
{"uid": 4, "rec_room_id": 2, "room_id": 64, "dispatch_id": 41, "label_map": [{"key": "gift_value_total", "value": "66"}, {"key": "follow_channel", "value": "24"}, {"key": "follow_user", "value": "3"}, {"key": "followed_channel", "value": "94"}, {"key": "filter", "value": "32"}, {"key": "mic_time", "value": "86"}], "mic_time": 63}
{"uid": 5, "rec_room_id": 10, "room_id": 28, "dispatch_id": 49, "label_map": null, "mic_time": 14}
{"uid": 6, "rec_room_id": null, "room_id": 74, "dispatch_id": 23, "label_map": null, "mic_time": 44}
{"uid": 7, "rec_room_id": 61, "room_id": 65, "dispatch_id": 46, "label_map": [{"key": "gift_value_total", "value": "22"}, {"key": "follow_channel", "value": "15"}, {"key": "follow_user", "value": "28"}, {"key": "followed_channel", "value": "70"}, {"key": "filter", "value": "30"}, {"key": "mic_time", "value": "63"}], "mic_time": 51}
{"uid": 8, "rec_room_id": null, "room_id": 20, "dispatch_id": 90, "label_map": [{"key": "gift_value_total", "value": "14"}, {"key": "follow_channel", "value": "97"}, {"key": "follow_user", "value": "99"}, {"key": "followed_channel", "value": "10"}, {"key": "filter", "value": "77"}, {"key": "mic_time", "value": "14"}], "mic_time": 41}
{"uid": 9, "rec_room_id": 16, "room_id": null, "dispatch_id": null, "label_map": null, "mic_time": 9} 
taiyang-li commented 15 hours ago

Reason: orc requires offsets[i+1] == offset[i] when the i-th row is null in MapVectorBatch, otherwise the written orc batch is not consistent with CH column. So is ListVectorBatch. But in CH, the nullable map column returned from function str_to_map maybe like:

| nullmap | offsets | keys |
|---------|---------|------|
| 1       | 2      | [k1, k2]  |
| 1       | 4      | [k3, k4]  |
| 0       | 6      | [k5, k6]  |
| 1       | 8      | [k7, k8]  |
taiyang-li commented 15 hours ago

Solution: Recursively truncate non-empty nested data when current row is null in CH Map column before writing to ORC/Parquet.