Open NEUpanning opened 7 months ago
cc @kecookier
@NEUpanning, I tried testing both main & 1.1 branch, but the result is NULL, consistent with Spark. Here is my test. Could you help check again?
DROP table if EXISTS tbl;
CREATE TABLE tbl(a STRING) USING parquet;
INSERT INTO tbl values('{"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}');
SELECT get_json_object(a,'$.11000022') FROM tbl;
@PHILO-HE Thanks for your reply. I can't reproduce it either, maybe there are other factors involved. I will take some time to figure out lately.
@NEUpanning Is the issue I'm experiencing the same as this one
with tb as (
select
concat('{"a": "2", "b": "', string (unhex(hex(id))), '"}') as js,
string(unhex(hex(id))) as st,
id as ascii_num
FROM range(0, 255, 1, 10)
)
select 'gluten', ascii_num,st,get_json_object(js, '$.a') from (select /*+ repartition(2) */ * from tb)
union all
select 'vanilla', ascii_num,st,get_json_object(js, '$.a') from tb
@wang-zhun Could you show the result of this SQL?
@wang-zhun They may be related.
@PHILO-HE Reviewing the commit history, you have more expertise in this section. Could you help verify this issue?
// Spark
private[this] object SharedFactory {
val jsonFactory = new JsonFactoryBuilder()
// The two options below enabled for Hive compatibility
.enable(JsonReadFeature.ALLOW_UNESCAPED_CONTROL_CHARS)
.enable(JsonReadFeature.ALLOW_SINGLE_QUOTES)
.build()
}
A simple modification to simdjson can resolve the issue, but it is uncertain if there will be other impacts
@wang-zhun, thanks for your investigation! As extended ASCII is not compatible with utf-8, Simdjson's utf-8 validation doesn't pass on those cases. I also did a few test with some invalid utf-8 input, it looks vanilla Spark always disregards input's validity regarding utf-8. I will ask for the feedback of some users to see whether we should disable the validation.
@PHILO-HE Thanks for your effort in #6661. I've cherry-picked this PR on our branch to test if this issue has been resolved, but the results are still mismatch with vanilla Spark. The json value in this issue doesn't contain extended ASCII and isn't a valid json. Therefore, this issue will remain open.
@NEUpanning, have you tried with branch-1.2 or main?
@PHILO-HE We are using v1.2.0-rc, higher version is not supported now.
@NEUpanning, can the below case reproduce this issue on your side? If not, could you provide a small reproducible case?
DROP table if EXISTS tbl;
CREATE TABLE tbl(a STRING) USING parquet;
INSERT INTO tbl values('{"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}');
SELECT get_json_object(a,'$.11000022') FROM tbl;
@PHILO-HE Same as before discussion, I still can't reproduce it without our table. When I find a small reproducible case, I'll get in touch with you asap. Thanks.
@PHILO-HE @NEUpanning @kecookier I find a sql that have a result mismatch:
select get_json_object('{"businessCode":"xxx","msgId":"12","msgTime":23,"assistantId":34,"assistantAccount":"xxx","friend":{"wwUserId":"xxx","nickname":"\\๑","unionId":"xxxx","avatar":"xxx","userType":1,"wwCorpId":"","wwCorpName":"","wwAccount":""},"message":{"contentType":1,"content":"xxx"}}', '$.friend.unionId');
Gluten returns xxxx
, but spark returns NULL
.
The reason is that when the json parser in Spark meets \
, it will further check the character after the \
to see whether it is a valid escape character. In this case, \๑
is not a valid escape character. So spark will return NULL
.
But in simdjson
, there exists no extra check.
So to fix this question, I think we need to add code in simdjson
to do the check.
@jiangjiangtian, after some investigations, I found simdjson ondemand API only validates structural correctness of JSON doc and the result for given JSON path. It's for performance consideration.
If our given JSON path is $.friend.nickname
, null is returned since the result for this JSON path contains invalid escaped char. Do you strongly feel that we require the alignment with Spark on such cases?
@jiangjiangtian, after some investigations, I found simdjson ondemand API only validates structural correctness of JSON doc and the result for given JSON path. It's for performance consideration. If our given JSON path is
$.friend.nickname
, null is returned since the result for this JSON path contains invalid escaped char. Do you strongly feel that we require the alignment with Spark on such cases?
@PHILO-HE Thanks for your investigations. We need to discuss internally to determine the requirement of the alignment.
Backend
VL (Velox)
Bug description
The following SQL might lead to wrong results, but it's not yet certain if there are other factors involved. Vanilla Spark results is
NULL
but gluten results isN
.select get_json_object(extend_attr,'$.11000022') from mart_catering.dim_deal_all_info_ss where mt_deal_id=922798418 and partition_date='2024-03-27';
The extend_attr field value is
{"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}
Unfortunately,i cannot reproduce this issue using simple SQL like
select get_json_object(col1,'$.11000022') from values('{"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}')
and the results isNULL
same as Vanilla Spark results.Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response