[FEA] Improve `GpuJsonToStructs` performance

The performance of our current GpuJsonToStructs is not good. When running the profiling, it looks like this:

In the particular test case for the profiling above, the only useful work is only what to the end of the read_json range (just above 300ms), which is less than 50% of the entire GpuJsonToStructs projection (>800ms). The rest are just overhead, but it consists mostly of hundreds of small kernel calls and stream syncs due to pure copying data from the intermediate result to the final output.

We can do a lot better by reducing the unnecessary overhead, or improving them by a way that they can run in a much less time. If we divide the runtime of GpuJsonToStructs into sections:

The improvement can be done by the following tasks:

Section 1: Improve libcudf's cudf::read_json. This needs help from cudf team. Beyond that, we can improve the performance of this section with some auxiliary work:
- https://github.com/NVIDIA/spark-rapids-jni/pull/2457
- https://github.com/NVIDIA/spark-rapids/pull/11549
Section 2: Improve the process that assembles the output table from cudf::read_json into the output structs column with the desired read schema. Currently, this process may need to copy a lot of columns from the output table of cudf::read_json (hundreds columns), which is a significant overhead. We can see it from the profiling of this section. We can just move them instead. This can be achieved by:
- https://github.com/rapidsai/cudf/issues/17002
- Dependencies:
  - https://github.com/rapidsai/cudf/issues/17090
  - https://github.com/rapidsai/cudf/issues/17091
Section 3: Improve the conversion step from strings columns into the desired types. Some columns need to be converted but some are just output directly without any conversion. However, instead of being moved into the output, they are again copied and that causes a lot of overhead if the number of strings columns is significant.
- spark-rapids-jni issue: https://github.com/NVIDIA/spark-rapids-jni/issues/2468, PR https://github.com/NVIDIA/spark-rapids-jni/pull/2510.

After discussion with @ttnghia, Here are the improvements planned for different sections:

Section 1: @karthikeyann and @shrshi are working on validation, and memory usage reduction here. https://github.com/rapidsai/cudf/pull/16996 https://github.com/rapidsai/cudf/pull/16978 TBD: To eliminate/minimize concat_json, Considering new strings_column input as data source or new json reader option, needs more planning (@shrshi)
Section 2: To eliminate Section 2 completely, @karthikeyann will work on adding new schema interface to support column ordering and all-null columns for non-existent columns. Only input schema requirement is that this input schema should not require sanitization inside libcudf reader. (that includes UTF-8 matching of column names, duplicate paths, invalid schema, etc). https://github.com/rapidsai/cudf/issues/17090 https://github.com/rapidsai/cudf/issues/17091
Section 3: @ttnghia will work to avoid copying columns after parsing. C++ invocation of libcudf reader, and parsing string columns to datatypes and move/replace columns. TBD: INT, FLOAT, STRING parsing rules - check libcudf compliant with spark requirements. Some types (DECIMAL) may have special cases that can only be handled in spark.

NVIDIA / spark-rapids

[FEA] Improve `GpuJsonToStructs` performance #11560