NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
784 stars 228 forks source link

[BUG] fix issues with repeated JSON columns #11361

Open revans2 opened 3 weeks ago

revans2 commented 3 weeks ago

Describe the bug To be clear I don't think this is a very important use case. Especially because Spark in inconsistent here too.

A JSON object can technically have repeated keys in it.

{"data":100,"data":200}

The JSON spec goes out of it's way to not specify how to interpret the data in JSON. Just what is and is not technically valid JSON. So repeated keys is ambiguous.

So for json_tuple the last non-null value always wins.

scala> Seq("""{"data":100,"data":200}""","""{"data":null,"data":200}""","""{"data":100,"data":null}""").toDF("json").selectExpr("*", "json_tuple(json, 'data')").show(false)
+------------------------+---+
|json                    |c0 |
+------------------------+---+
|{"data":100,"data":200} |200|
|{"data":null,"data":200}|200|
|{"data":100,"data":null}|100|
+------------------------+---+

Data type does not really matter because everything is returned as a String

scala> Seq("""{"data":100,"data":"200"}""","""{"data":null,"data":false}""","""{"data":100.012,"data":null}""").toDF("json").selectExpr("*", "json_tuple(json, 'data')").show(false)
+----------------------------+-------+
|json                        |c0     |
+----------------------------+-------+
|{"data":100,"data":"200"}   |200    |
|{"data":null,"data":false}  |false  |
|{"data":100.012,"data":null}|100.012|
+----------------------------+-------+

For from_json it is much more complicated because types are involved.

Here it appears to be that last always wins if the input data can be manipulated to match the desired type. And null can always be manipulated???

scala> Seq("""{"data":100,"data":"200"}""","""{"data":"100","data":200}""","""{"data":null,"data":false}""","""{"data":true,"data":false}""","""{"data":100.012,"data":null}""").toDF("json").selectExpr("*", "from_json(json, 'data string')").show(false)
+----------------------------+---------------+
|json                        |from_json(json)|
+----------------------------+---------------+
|{"data":100,"data":"200"}   |{200}          |
|{"data":"100","data":200}   |{200}          |
|{"data":null,"data":false}  |{false}        |
|{"data":true,"data":false}  |{false}        |
|{"data":100.012,"data":null}|{null}         |
+----------------------------+---------------+

In this the last row is actually a null, not a string with "null" in it. If I ask for it as a long, then the "200" for the first row is ignored because it cannot be made into a long.

scala> Seq("""{"data":100,"data":"200"}""","""{"data":"100","data":200}""","""{"data":null,"data":false}""","""{"data":true,"data":false}""","""{"data":100.012,"data":null}""").toDF("json").selectExpr("*", "from_json(json, 'data long')").show(false)
+----------------------------+---------------+
|json                        |from_json(json)|
+----------------------------+---------------+
|{"data":100,"data":"200"}   |{100}          |
|{"data":"100","data":200}   |{200}          |
|{"data":null,"data":false}  |{null}         |
|{"data":true,"data":false}  |{null}         |
|{"data":100.012,"data":null}|{null}         |
+----------------------------+---------------+

Double is similar, but even though "100.012" can be made into a double, the last null value still wins, which is different from json_tuple

scala> Seq("""{"data":100,"data":"200"}""","""{"data":"100","data":200}""","""{"data":null,"data":false}""","""{"data":true,"data":false}""","""{"data":100.012,"data":null}""").toDF("json").selectExpr("*", "from_json(json, 'data double')").show(false)
+----------------------------+---------------+
|json                        |from_json(json)|
+----------------------------+---------------+
|{"data":100,"data":"200"}   |{100.0}        |
|{"data":"100","data":200}   |{200.0}        |
|{"data":null,"data":false}  |{null}         |
|{"data":true,"data":false}  |{null}         |
|{"data":100.012,"data":null}|{null}         |
+----------------------------+---------------+

Nesting appears to be very similar. It goes with last wins, so long as the data can be manipulated into the desired data type.

scala> Seq("""{"data":100,"data":{"a":50,"a":100}}""","""{"data":{"b":100},"data":{"a":200}}""","""{"data":{"a":100},"data":{"b":200}}""","""{"data":{"a":101,"a":102},"data":null}""","""{"data":{"a":100},"data":"test"}""").toDF("json").selectExpr("*", "from_json(json, 'data struct<a:string>')").show(false)
+--------------------------------------+---------------+
|json                                  |from_json(json)|
+--------------------------------------+---------------+
|{"data":100,"data":{"a":50,"a":100}}  |{{100}}        |
|{"data":{"b":100},"data":{"a":200}}   |{{200}}        |
|{"data":{"a":100},"data":{"b":200}}   |{{null}}       |
|{"data":{"a":101,"a":102},"data":null}|{null}         |
|{"data":{"a":100},"data":"test"}      |{{100}}        |
+--------------------------------------+---------------+

get_json_object appears to always be first non-null match wins.

scala> Seq("""{"data":100,"data":{"a":50,"a":100}}""","""{"data":{"b":100},"data":{"a":200}}""","""{"data":{"a":100},"data":{"b":200}}""","""{"data":{"a":101,"a":102},"data":null}""","""{"data":{"a":100},"data":"test"}""").toDF("json").selectExpr("*", "get_json_object(json, '$.data.a')").show(false)
+--------------------------------------+-------------------------------+
|json                                  |get_json_object(json, $.data.a)|
+--------------------------------------+-------------------------------+
|{"data":100,"data":{"a":50,"a":100}}  |50                             |
|{"data":{"b":100},"data":{"a":200}}   |200                            |
|{"data":{"a":100},"data":{"b":200}}   |100                            |
|{"data":{"a":101,"a":102},"data":null}|101                            |
|{"data":{"a":100},"data":"test"}      |100                            |
+--------------------------------------+-------------------------------+
scala> Seq("""{"data":100,"data":"200"}""","""{"data":"100","data":200}""","""{"data":null,"data":false}""","""{"data":true,"data":false}""","""{"data":100.012,"data":null}""").toDF("json").selectExpr("*", "get_json_object(json, '$.data')").show(false)
+----------------------------+-----------------------------+
|json                        |get_json_object(json, $.data)|
+----------------------------+-----------------------------+
|{"data":100,"data":"200"}   |100                          |
|{"data":"100","data":200}   |100                          |
|{"data":null,"data":false}  |false                        |
|{"data":true,"data":false}  |true                         |
|{"data":100.012,"data":null}|100.012                      |
+----------------------------+-----------------------------+

Happily our implementation for get_json_object is already fairly close on this. But not 100% of the way there. It is missing the not null part and only is doing the first match part.

mattahrens commented 3 weeks ago

Potential scope: throw exception when repeated key is encountered in JSON parsing in the plugin.