Closed Dooyoung-Hwang closed 3 years ago
If the custom DataSource can provide RDD[ColumnBatch] to spark-rapids directly, it would be more efficient because the conversion overhead is removed.
Does this RDD[ColumnarBatch]
contain GPU data or CPU data? If the latter there still would be a conversion from host columnar data to device columnar data. That type of conversion is already supported by the plugin, but it's important to note that a (cheaper) conversion would still occur. The plan would have a HostColumnarToGpu
node instead of a GpuRowToColumnar
node.
After discussions data source v1 doesn't support columnar so switch to use data source v2. With datasource v2, custom datasources just work and we insert a HostColumnarToGpu transition to get the data onto the GPU.
In this case I believe the data will already be in an Arrow format ArrowColumnVector we can investigate making the HostColumnarToGpu smarter about getting the data onto the GPU
note that looking at a couple of sample queries it uses Round of a decimal, which support for it in progress and it also uses average of a decimal which we don't support yet.
note for sample queries and data we can look at the taxi ride dataset and queries:
https://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/ explanation - https://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html 4 queries can so be found here: https://tech.marksblogg.com/omnisci-macos-macbookpro-mbp.html
This is the result of other solutions. https://tech.marksblogg.com/benchmarks.html
Rounding support is being worked on in https://github.com/NVIDIA/spark-rapids/pull/1244 .
Average should work once we support casting, which is being tracked in this issue: https://github.com/NVIDIA/spark-rapids/issues/1330 .
Note we may also need percentile_approx here.
cudf jira for percentile_approx -> https://github.com/rapidsai/cudf/issues/7170
the main functionality to support faster copy when using datasourcev2 supplying arrow data is commited under https://github.com/NVIDIA/spark-rapids/pull/1622. It supports primitive types and Strings. It does not support Decimal or nested types yet.
note filed separate issue for write side https://github.com/NVIDIA/spark-rapids/issues/1648.
I'm going to close this as the initial version is committed
Is your feature request related to a problem? Please describe. When I executed an aggregation query with our custom data source, I found the physical plan of the query was like this.
This shows that the InternalRows are built firstly, and they are transformed into ColumnarBatches by GpuRowToColumnar plan. If the custom DataSource can provide RDD[ColumnBatch] to spark-rapids directly, it would be more efficient because the conversion overhead is removed.
Describe the solution you'd like
The changed physical plan can be illustrated like this.