Open ttnghia opened 2 weeks ago
After discussion with @ttnghia, Here are the improvements planned for different sections:
column ordering
and all-null
columns for non-existent
columns.
Only input schema requirement is that this input schema should not require sanitization inside libcudf reader. (that includes UTF-8 matching of column names, duplicate paths, invalid schema, etc).
https://github.com/rapidsai/cudf/issues/17090
https://github.com/rapidsai/cudf/issues/17091
The performance of our current
GpuJsonToStructs
is not good. When running the profiling, it looks like this:In the particular test case for the profiling above, the only useful work is only what to the end of the
read_json
range (just above 300ms), which is less than 50% of the entireGpuJsonToStructs
projection (>800ms). The rest are just overhead, but it consists mostly of hundreds of small kernel calls and stream syncs due to pure copying data from the intermediate result to the final output.We can do a lot better by reducing the unnecessary overhead, or improving them by a way that they can run in a much less time. If we divide the runtime of
GpuJsonToStructs
into sections:The improvement can be done by the following tasks:
cudf::read_json
. This needs help from cudf team. Beyond that, we can improve the performance of this section with some auxiliary work:cudf::read_json
into the output structs column with the desired read schema. Currently, this process may need to copy a lot of columns from the output table ofcudf::read_json
(hundreds columns), which is a significant overhead. We can see it from the profiling of this section. We can just move them instead. This can be achieved by: