Closed jlowe closed 2 months ago
The output shows that an input of {"a": {"b":"md"} }
produces the same results between CPU and GPU:
[2024-01-31T20:47:58.663Z] Row(a='{"a": {"b":"md"} }', from_json(a)=Row(a='{"b":"md"}'))
But an almost identical input of {"a": {"b":"mh"} }
produces different results between CPU and GPU:
[2024-01-31T20:47:58.663Z] -Row(a='{"a": {"b":"mh"} }', from_json(a)=Row(a='{"b":"mh"}'))
[2024-01-31T20:47:58.663Z] +Row(a='{"a": {"b":"mh"} }', from_json(a)=Row(a='{mh}'))
I have been unable to reproduce this so far with Spark 3.3.0, even using the same datagen seed.
However, a manual test does show differences between CPU and GPU, but does not match the results from the failed CI run exactly.
scala> val df = Seq("""{"a": {"b":"md"} }""", """{"a": {"b":"mh"} }""").toDF("json").repartition(2)
scala> spark.conf.set("spark.rapids.sql.expression.JsonToStructs", true)
scala> spark.conf.set("spark.rapids.sql.json.read.mixedTypesAsString.enabled", true)
scala> import org.apache.spark.sql.types._
scala> val schema = StructType(Seq(StructField("a", DataTypes.StringType, true)))
scala> df.select(col("json"), from_json(col("json"), schema)).show
+------------------+---------------+
| json|from_json(json)|
+------------------+---------------+
|{"a": {"b":"md"} }| {{md}}|
|{"a": {"b":"mh"} }| {{mh}}|
+------------------+---------------+
+------------------+---------------+
| json|from_json(json)|
+------------------+---------------+
|{"a": {"b":"md"} }| {{"b":"md"}}|
|{"a": {"b":"mh"} }| {{"b":"mh"}}|
+------------------+---------------+
Note that this failure was from a distributed cluster setup, so the nature of the failure may have something to do with how the input data is partitioned across tasks. That particular distribution is probably not replicated in the default local run environment.
Also, my manual test is using show
... if I run collect
then I do see the same results. I think the show
issue is already known under issue https://github.com/NVIDIA/spark-rapids/issues/8558
Note that this failure was from a distributed cluster setup, so the nature of the failure may have something to do with how the input data is partitioned across tasks. That particular distribution is probably not replicated in the default local run environment.
So if some partitions contain mixed types and some don't ... I will try and repro that in an integration test.
I will create a PR to xfail this test while I investigate.
The test code from the comment in https://github.com/NVIDIA/spark-rapids/issues/10351#issuecomment-1920128622 now works, but the test itself still fails because it needs support for LISTs not just STRUCTs.
https://github.com/rapidsai/cudf/issues/15278
is the issue we want/need fixed for this to start passing.
Details
``` [2024-01-31T20:47:58.663Z] =================================== FAILURES =================================== [2024-01-31T20:47:58.663Z] ___________ test_from_json_mixed_types_list_struct[struct