NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
823 stars 236 forks source link

Update to_json to be more generic and fix some bugs #11642

Closed revans2 closed 1 month ago

revans2 commented 1 month ago

I think this fixes a lot of issues related to to_json.

This fixes https://github.com/NVIDIA/spark-rapids/issues/10924 This fixes https://github.com/NVIDIA/spark-rapids/issues/10923 This fixes https://github.com/NVIDIA/spark-rapids/issues/10921 This fixes https://github.com/NVIDIA/spark-rapids/issues/10920 This fixes https://github.com/NVIDIA/spark-rapids/issues/10919 This fixes https://github.com/NVIDIA/spark-rapids/issues/10916 This fixes https://github.com/NVIDIA/spark-rapids/issues/10915 This fixes https://github.com/NVIDIA/spark-rapids/issues/10896 This fixes https://github.com/NVIDIA/spark-rapids/issues/10895 This fixes https://github.com/NVIDIA/spark-rapids/issues/10894

This also makes one not crash any more, but it falls back to the CPU

There really are only a few changes where.

  1. Call castToString instead of forcing the top level type to be a struct. (This lets us converts arrays, and maps as top level items too)
  2. Moved quoting/escaping strings to be in cast to string for strings, dates, and timestamps. This fixes an issue where arrays would not always quote those values appropriately.
  3. Added in many more checks to fall back to the CPU (Specifically to fall back for Maps with non-string keys as Spark is not great with this)
  4. Updated the tests to remove xfail tests that we know we are falling back to the CPU for, and verified that we have tests that cover those cases with a fallback test.
revans2 commented 1 month ago

build

revans2 commented 1 month ago

build