NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

[AUDIT] [SPARK-49743][SQL] OptimizeCsvJsonExpr should not change schema fields when pruning GetArrayStructFields #11691

Closed amahussein closed 2 weeks ago

amahussein commented 2 weeks ago

https://github.com/apache/spark/commit/a4fb6cbfda2

This PR affects the from_json operator and at least we need to test the behavior on the plugin.

SELECT
  from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array<struct<a: INT, b: INT>>').a,
  from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array<struct<a: INT, b: INT>>').A
FROM
  range(3) as t

Earlier, the result would had been:

Array([ArraySeq(0),ArraySeq(null)], [ArraySeq(1),ArraySeq(null)], [ArraySeq(2),ArraySeq(null)])

vs the new result is (verified through spark-shell):

Array([ArraySeq(0),ArraySeq(0)], [ArraySeq(1),ArraySeq(1)], [ArraySeq(2),ArraySeq(2)])
revans2 commented 2 weeks ago

I just looked at this a bit more deeply, and this is a bug in a logical plan optimization in Spark. What is more we don't support top level arrays in from_json yet, so this does not impact us at all.