NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

[BUG] Match `from_json` behaviour on Databricks 14.3. #11711

Open mythrocks opened 2 weeks ago

mythrocks commented 2 weeks ago

The behaviour of from_json seems to have changed on Databricks 14.3.

This was revealed as part of a test failure (json_matrix_test.py::test_from_json_long_structs) on Databricks. Here is the effective repro (using the test input json file from the test):

import pyspark.sql.functions as f
from pyspark.sql.types import *

input_file = "/home/ubuntu/spark-rapids/integration_tests/src/test/resources/int_struct_formatted.json"
schema = StructType([StructField("data", StructType([StructField("A", LongType()),StructField("B", LongType())]))])

df = spark.read.text(input_file).withColumnRenamed("value", "json")
op_df = df.select(f.col('json'), f.from_json(f.col('json'), schema))

op_df.show(100, False)

The output on Apache Spark 3.5 (and all other Apache Spark versions) is:

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{{NULL, NULL}} |
+----------------------------------------------------------------+---------------+

On Databricks 14.3, the last record is NULL, and not {{NULL, NULL}}.

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{NULL}         |
+----------------------------------------------------------------+---------------+

I fear this will involve a policy change in the CUDF implementation of from_json, and using it from a 350db shim. (I'm not an expert on the JSON parsing end of this.)

revans2 commented 1 week ago

I think we can probably do some of this in post processing. We have similar issues for overflow on arrays and structs on really old versions of Spark invalidate the entire struct if there was a single overflow in it.

I am not sure what the priority for this really is through.

ttnghia commented 1 week ago

Please confirm if this change applies to only overflow or other invalid input. In particular, please add these input lines to the test file and post the output:

{"data": {"A": 0, "B": xyz}}
{"data": {"A": 0, "B": "0"}}
{"data": {"A": 0, "B": }}
{"data": {"A": 0, "B": "}}
mythrocks commented 1 week ago

Here's the input with @ttnghia's corner cases added:

{"data": {"A": 0, "B": 1}}
{"data": {"A": 1}}
{"data": {"B": 50}}
{"data": {"B": -128, "A": 127}}
{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}
{"data": {"A": 0, "B": xyz}}
{"data": {"A": 0, "B": "0"}}
{"data": {"A": 0, "B": }}
{"data": {"A": 0, "B": "}}

Here's the output from Apache Spark 3.5.x, which matches spark-rapids, and nearly all Databricks versions:

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{{NULL, NULL}} |
|{"data": {"A": 0, "B": xyz}}                                    |{NULL}         |
|{"data": {"A": 0, "B": "0"}}                                    |{{0, NULL}}    |
|{"data": {"A": 0, "B": }}                                       |{NULL}         |
|{"data": {"A": 0, "B": "}}                                      |{NULL}         |
+----------------------------------------------------------------+---------------+

Here's what Databricks 14.3 returns:

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{NULL}         |
|{"data": {"A": 0, "B": xyz}}                                    |{NULL}         |
|{"data": {"A": 0, "B": "0"}}                                    |{NULL}         |
|{"data": {"A": 0, "B": }}                                       |{NULL}         |
|{"data": {"A": 0, "B": "}}                                      |{NULL}         |
+----------------------------------------------------------------+---------------+

The 5th and 7th rows are different.

mythrocks commented 1 week ago

As discussed, I'm not inclined to "solve" the problem at this time. I'll refactor the tests so that the problematic rows are skipped in an xfailed test. We can revisit this for a proper fix.

ttnghia commented 1 week ago

It seems that the null rows of the children column due to failure in casting will always nullify the top level columns. We need to check that when working on this issue. If this is the case, fixing this will be less complex.

mythrocks commented 1 week ago

failure in casting will always nullify the top level columns.

One wonders how far up the chain the nullification is transmitted. That's worth digging into at a different time.