Open mythrocks opened 2 weeks ago
I think we can probably do some of this in post processing. We have similar issues for overflow on arrays and structs on really old versions of Spark invalidate the entire struct if there was a single overflow in it.
I am not sure what the priority for this really is through.
Please confirm if this change applies to only overflow or other invalid input. In particular, please add these input lines to the test file and post the output:
{"data": {"A": 0, "B": xyz}}
{"data": {"A": 0, "B": "0"}}
{"data": {"A": 0, "B": }}
{"data": {"A": 0, "B": "}}
Here's the input with @ttnghia's corner cases added:
{"data": {"A": 0, "B": 1}}
{"data": {"A": 1}}
{"data": {"B": 50}}
{"data": {"B": -128, "A": 127}}
{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}
{"data": {"A": 0, "B": xyz}}
{"data": {"A": 0, "B": "0"}}
{"data": {"A": 0, "B": }}
{"data": {"A": 0, "B": "}}
Here's the output from Apache Spark 3.5.x, which matches spark-rapids
, and nearly all Databricks versions:
+----------------------------------------------------------------+---------------+
|json |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}} |{{0, 1}} |
|{"data": {"A": 1}} |{{1, NULL}} |
|{"data": {"B": 50}} |{{NULL, 50}} |
|{"data": {"B": -128, "A": 127}} |{{127, -128}} |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{{NULL, NULL}} |
|{"data": {"A": 0, "B": xyz}} |{NULL} |
|{"data": {"A": 0, "B": "0"}} |{{0, NULL}} |
|{"data": {"A": 0, "B": }} |{NULL} |
|{"data": {"A": 0, "B": "}} |{NULL} |
+----------------------------------------------------------------+---------------+
Here's what Databricks 14.3 returns:
+----------------------------------------------------------------+---------------+
|json |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}} |{{0, 1}} |
|{"data": {"A": 1}} |{{1, NULL}} |
|{"data": {"B": 50}} |{{NULL, 50}} |
|{"data": {"B": -128, "A": 127}} |{{127, -128}} |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{NULL} |
|{"data": {"A": 0, "B": xyz}} |{NULL} |
|{"data": {"A": 0, "B": "0"}} |{NULL} |
|{"data": {"A": 0, "B": }} |{NULL} |
|{"data": {"A": 0, "B": "}} |{NULL} |
+----------------------------------------------------------------+---------------+
The 5th and 7th rows are different.
As discussed, I'm not inclined to "solve" the problem at this time. I'll refactor the tests so that the problematic rows are skipped in an xfail
ed test. We can revisit this for a proper fix.
It seems that the null rows of the children column due to failure in casting will always nullify the top level columns. We need to check that when working on this issue. If this is the case, fixing this will be less complex.
failure in casting will always nullify the top level columns.
One wonders how far up the chain the nullification is transmitted. That's worth digging into at a different time.
The behaviour of
from_json
seems to have changed on Databricks 14.3.This was revealed as part of a test failure (
json_matrix_test.py::test_from_json_long_structs
) on Databricks. Here is the effective repro (using the test inputjson
file from the test):The output on Apache Spark 3.5 (and all other Apache Spark versions) is:
On Databricks 14.3, the last record is
NULL
, and not{{NULL, NULL}}
.I fear this will involve a policy change in the CUDF implementation of
from_json
, and using it from a350db
shim. (I'm not an expert on the JSON parsing end of this.)