apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 438 forks source link

[VL][1.2] Result mismatch of get_json_object when json string has newline #7777

Open wForget opened 3 weeks ago

wForget commented 3 weeks ago

Backend

VL (Velox)

Bug description

sql:

select get_json_object('{"c1":"test\ntest"}', '$.c1')

result of gluten 1.2.0 with velox:

+--------------------------------------------+--+
| get_json_object({"c1":"test
test"}, $.c1)  |
+--------------------------------------------+--+
| NULL                                       |
+--------------------------------------------+--+

result of valilla spark:

+--------------------------------------------+--+
| get_json_object({"c1":"test
test"}, $.c1)  |
+--------------------------------------------+--+
| test
test                                  |
+--------------------------------------------+--+

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

rui-mo commented 3 weeks ago

cc: @PHILO-HE

PHILO-HE commented 3 weeks ago

@wForget, it's strange. I just applied the below patch to test your case on Velox side (1.2.0 velox branch), the test passed.

diff --git a/velox/functions/sparksql/tests/JsonFunctionsTest.cpp b/velox/functions/sparksql/tests/JsonFunctionsTest.cpp
index c0c8ecc90..f9448733a 100644
--- a/velox/functions/sparksql/tests/JsonFunctionsTest.cpp
+++ b/velox/functions/sparksql/tests/JsonFunctionsTest.cpp
@@ -119,5 +119,9 @@ TEST_F(GetJsonObjectTest, nullResult) {
       std::nullopt);
 }

+TEST_F(GetJsonObjectTest, escaped) {
+  EXPECT_EQ(getJsonObject(R"({"c1":"test\ntest"})", "$.c1"), "test\ntest");
+}
+
 } // namespace
 } // namespace facebook::velox::functions::sparksql::test
wForget commented 3 weeks ago

R"({"c1":"test\ntest"})"

Does this mean that \n is not escaped?

PHILO-HE commented 3 weeks ago

@wForget, no, it's escaped. Just verified by printing getJsonObject(R"({"c1":"test\ntest"})", "$.c1")

wForget commented 3 weeks ago

@wForget, no, it's escaped. Just verified by printing getJsonObject(R"({"c1":"test\ntest"})", "$.c1")

Could you try:

const std::string json= R"(
  {
    "c1":"test
test"
  }
  )";
getJsonObject(json, "$.c1")
wForget commented 3 weeks ago

I guess this may be due to spark using some non-standard json parsing behavior.

https://github.com/apache/spark/blob/c53dac05058c48ae1edad7912e8cc82533839ca0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L112-L113

wForget commented 3 weeks ago

It seems that SINGLE QUOTES is also not allowed.

select get_json_object('{\'c1\':\'test test\'}', '$.c1');

gluten disabled:

+--------------------------------------------+--+
| get_json_object({'c1':'test test'}, $.c1)  |
+--------------------------------------------+--+
| test test                                  |
+--------------------------------------------+--+

gluten enabled:

+--------------------------------------------+--+
| get_json_object({'c1':'test test'}, $.c1)  |
+--------------------------------------------+--+
| NULL                                       |
+--------------------------------------------+--+
PHILO-HE commented 3 weeks ago

@wForget, it's a known incompatibility issue in using single quotes. See doc link.

As far as I know, using single quote to enclose JSON content is not allowed in JSON standard. Not sure why Spark allows using it to replace double quote. We have no plan to support it.

PHILO-HE commented 2 weeks ago

@wForget, it's strange. I just applied the below patch to test your case on Velox side (1.2.0 velox branch), the test passed.

diff --git a/velox/functions/sparksql/tests/JsonFunctionsTest.cpp b/velox/functions/sparksql/tests/JsonFunctionsTest.cpp
index c0c8ecc90..f9448733a 100644
--- a/velox/functions/sparksql/tests/JsonFunctionsTest.cpp
+++ b/velox/functions/sparksql/tests/JsonFunctionsTest.cpp
@@ -119,5 +119,9 @@ TEST_F(GetJsonObjectTest, nullResult) {
       std::nullopt);
 }

+TEST_F(GetJsonObjectTest, escaped) {
+  EXPECT_EQ(getJsonObject(R"({"c1":"test\ntest"})", "$.c1"), "test\ntest");
+}
+
 } // namespace
 } // namespace facebook::velox::functions::sparksql::test

Using regular string instead of raw string can reproduce this issue. It also occurs on the main branch. I found Presto also allows control characters, like Spark. We may have to change simdjson's code to fix this issue. But not sure whether it is acceptable. See Velox PR: https://github.com/facebookincubator/velox/pull/11433