apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

[VL] Results are mismatch with vanilla Spark on release-1.1 when use get_json_object operator #5253

Open NEUpanning opened 7 months ago

NEUpanning commented 7 months ago

Backend

VL (Velox)

Bug description

The following SQL might lead to wrong results, but it's not yet certain if there are other factors involved. Vanilla Spark results is NULL but gluten results is N. select get_json_object(extend_attr,'$.11000022') from mart_catering.dim_deal_all_info_ss where mt_deal_id=922798418 and partition_date='2024-03-27';

The extend_attr field value is {"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}

Unfortunately,i cannot reproduce this issue using simple SQL like select get_json_object(col1,'$.11000022') from values('{"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}') and the results is NULL same as Vanilla Spark results.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

NEUpanning commented 7 months ago

cc @kecookier

PHILO-HE commented 7 months ago

@NEUpanning, I tried testing both main & 1.1 branch, but the result is NULL, consistent with Spark. Here is my test. Could you help check again?

DROP table if EXISTS tbl;
CREATE TABLE tbl(a STRING) USING parquet;
INSERT INTO tbl values('{"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}');
SELECT get_json_object(a,'$.11000022') FROM tbl;
NEUpanning commented 7 months ago

@PHILO-HE Thanks for your reply. I can't reproduce it either, maybe there are other factors involved. I will take some time to figure out lately.

wang-zhun commented 4 months ago

@NEUpanning Is the issue I'm experiencing the same as this one

with tb as (
select
    concat('{"a": "2", "b": "', string (unhex(hex(id))), '"}') as js,
    string(unhex(hex(id))) as st,
    id as ascii_num
FROM range(0, 255, 1, 10)
)
select 'gluten', ascii_num,st,get_json_object(js,  '$.a') from (select /*+ repartition(2) */ * from tb)
union all 
select 'vanilla', ascii_num,st,get_json_object(js,  '$.a')  from tb
NEUpanning commented 3 months ago

@wang-zhun Could you show the result of this SQL?

wang-zhun commented 3 months ago

image

NEUpanning commented 3 months ago

@wang-zhun They may be related.

wang-zhun commented 3 months ago

@PHILO-HE Reviewing the commit history, you have more expertise in this section. Could you help verify this issue?

// Spark
private[this] object SharedFactory {
  val jsonFactory = new JsonFactoryBuilder()
    // The two options below enabled for Hive compatibility
    .enable(JsonReadFeature.ALLOW_UNESCAPED_CONTROL_CHARS)
    .enable(JsonReadFeature.ALLOW_SINGLE_QUOTES)
    .build()
}
wang-zhun commented 3 months ago

A simple modification to simdjson can resolve the issue, but it is uncertain if there will be other impacts

PHILO-HE commented 3 months ago

@wang-zhun, thanks for your investigation! As extended ASCII is not compatible with utf-8, Simdjson's utf-8 validation doesn't pass on those cases. I also did a few test with some invalid utf-8 input, it looks vanilla Spark always disregards input's validity regarding utf-8. I will ask for the feedback of some users to see whether we should disable the validation.

NEUpanning commented 3 months ago

@PHILO-HE Thanks for your effort in #6661. I've cherry-picked this PR on our branch to test if this issue has been resolved, but the results are still mismatch with vanilla Spark. The json value in this issue doesn't contain extended ASCII and isn't a valid json. Therefore, this issue will remain open.

PHILO-HE commented 3 months ago

@NEUpanning, have you tried with branch-1.2 or main?

NEUpanning commented 3 months ago

@PHILO-HE We are using v1.2.0-rc, higher version is not supported now.

PHILO-HE commented 3 months ago

@NEUpanning, can the below case reproduce this issue on your side? If not, could you provide a small reproducible case?

DROP table if EXISTS tbl;
CREATE TABLE tbl(a STRING) USING parquet;
INSERT INTO tbl values('{"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}');
SELECT get_json_object(a,'$.11000022') FROM tbl;
NEUpanning commented 3 months ago

@PHILO-HE Same as before discussion, I still can't reproduce it without our table. When I find a small reproducible case, I'll get in touch with you asap. Thanks.

jiangjiangtian commented 2 months ago

@PHILO-HE @NEUpanning @kecookier I find a sql that have a result mismatch:

select get_json_object('{"businessCode":"xxx","msgId":"12","msgTime":23,"assistantId":34,"assistantAccount":"xxx","friend":{"wwUserId":"xxx","nickname":"\\๑","unionId":"xxxx","avatar":"xxx","userType":1,"wwCorpId":"","wwCorpName":"","wwAccount":""},"message":{"contentType":1,"content":"xxx"}}', '$.friend.unionId');

Gluten returns xxxx, but spark returns NULL. The reason is that when the json parser in Spark meets \, it will further check the character after the \ to see whether it is a valid escape character. In this case, \๑ is not a valid escape character. So spark will return NULL.

https://github.com/FasterXML/jackson-core/blob/8744bd42770c9e277d995ef00fb518940efef3ef/src/main/java/com/fasterxml/jackson/core/json/ReaderBasedJsonParser.java#L2648-#L2699

But in simdjson, there exists no extra check. So to fix this question, I think we need to add code in simdjson to do the check.

PHILO-HE commented 2 months ago

@jiangjiangtian, after some investigations, I found simdjson ondemand API only validates structural correctness of JSON doc and the result for given JSON path. It's for performance consideration. If our given JSON path is $.friend.nickname, null is returned since the result for this JSON path contains invalid escaped char. Do you strongly feel that we require the alignment with Spark on such cases?

jiangjiangtian commented 2 months ago

@jiangjiangtian, after some investigations, I found simdjson ondemand API only validates structural correctness of JSON doc and the result for given JSON path. It's for performance consideration. If our given JSON path is $.friend.nickname, null is returned since the result for this JSON path contains invalid escaped char. Do you strongly feel that we require the alignment with Spark on such cases?

@PHILO-HE Thanks for your investigations. We need to discuss internally to determine the requirement of the alignment.