Open a-agrz opened 6 years ago
Hello,
we have run unsuccessful tests with DataFrames, which branch of GPUEnabler do you use ?
I'm making a slight correction to my earlier statement.
The master
branch has support for Dataset
and not Dataframe/Dataset[Row]
. We need to know the field/column name to be picked up and feed into GPU for processing. This is possible only with Dataset and not with Dataframes.
Check out the example : https://github.com/IBMSparkGPU/GPUEnabler/blob/master/examples/src/main/scala/com/ibm/gpuenabler/SparkDSLR.scala
Actually, I have watched the Spark Summit '18 presentation of Dr. Kazuaki Ishizaki @kiszk and Madhusudanan Kandasamy @kmadhugit and in the second part of it, he talks about interesting results obtained with DataFrame, should I understand that he in fact talks about Dataframe/Dataset[Row] ?
@francois-wellenreiter , I understand your question. The talk mentioned about 2 approaches to offload to GPU, (1) Users are required to provide a compiled CUDA module and the plugin takes care of data movement and produce results (2) Transparent GPU offloading where the CUDA modules are autogenerated at runtime for a set of supported operations like select, selectExpr etc.
My above responses regarding Dataset are related to Approach (1). GPUEnabler plugin can help you with that approach and can work along side Apache Spark. For this, Dataset are a better fit.
For Approach (2), you need modifications done inside Spark code. Since its not integrated into Spark Codebase yet, you can try out the custom Spark found at https://github.com/IBMSparkGPU/SparkGPU.
Yes. I provides APIs for both Dataframes and RDD. We recommend to use only Dataframe GpuEnabler API. Check out the samples for usage.