IBMSparkGPU / GPUEnabler

Provides GPU awareness to Spark, Contact: @kmadhugit and @kiszk
Apache License 2.0
172 stars 59 forks source link

Does GPUEnabler supports DataFrames ? #82

Open a-agrz opened 6 years ago

josiahsams commented 6 years ago

Yes. I provides APIs for both Dataframes and RDD. We recommend to use only Dataframe GpuEnabler API. Check out the samples for usage.

francois-wellenreiter commented 6 years ago

Hello,

we have run unsuccessful tests with DataFrames, which branch of GPUEnabler do you use ?

josiahsams commented 6 years ago

I'm making a slight correction to my earlier statement. The master branch has support for Dataset and not Dataframe/Dataset[Row]. We need to know the field/column name to be picked up and feed into GPU for processing. This is possible only with Dataset and not with Dataframes.

Check out the example : https://github.com/IBMSparkGPU/GPUEnabler/blob/master/examples/src/main/scala/com/ibm/gpuenabler/SparkDSLR.scala

francois-wellenreiter commented 6 years ago

Actually, I have watched the Spark Summit '18 presentation of Dr. Kazuaki Ishizaki @kiszk and Madhusudanan Kandasamy @kmadhugit and in the second part of it, he talks about interesting results obtained with DataFrame, should I understand that he in fact talks about Dataframe/Dataset[Row] ?

josiahsams commented 5 years ago

@francois-wellenreiter , I understand your question. The talk mentioned about 2 approaches to offload to GPU, (1) Users are required to provide a compiled CUDA module and the plugin takes care of data movement and produce results (2) Transparent GPU offloading where the CUDA modules are autogenerated at runtime for a set of supported operations like select, selectExpr etc.

My above responses regarding Dataset are related to Approach (1). GPUEnabler plugin can help you with that approach and can work along side Apache Spark. For this, Dataset are a better fit.

For Approach (2), you need modifications done inside Spark code. Since its not integrated into Spark Codebase yet, you can try out the custom Spark found at https://github.com/IBMSparkGPU/SparkGPU.