NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
749 stars 221 forks source link

[FEA] Move all JSON parsing to the same backend as get_json_object #10804

Open revans2 opened 1 month ago

revans2 commented 1 month ago

Is your feature request related to a problem? Please describe. This is an epic intended to get us to a point where all JSON parsing functionality can be enabled by default. This is not intended to be the final long term solution. We really want to have a common JSON parser/tokenizer that is owned and maintained by CUDF. But in order for us to have correctness and at least good enough performance in the short term we are going to go with this approach.

The first thing we need is to establish a baseline in terms of performance so we can be sure that we are not regressing in get_json_object as we make changes to the tokenization to make it more configurable.

As a part of this we also need to finish writing all of the JSON tests we can come up with.

After this we need to do some refactoring to the JSON tokenizer in https://github.com/NVIDIA/spark-rapids-jni/blob/branch-24.06/src/main/cpp/src/json_parser.cuh from_json and the json input format are configurable in a number of ways that we need to support. get_json_object and json_tuple are not configurable and the current tokenizer has been hard coded to handle those settings.

Finally we will need to write some custom implementations of different operators so we can hopefully improve the total performance.