apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
373 stars 78 forks source link

Spike: evaluate if cuDF can be used with datafusion-python #936

Open timsaucer opened 1 week ago

timsaucer commented 1 week ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

As other DataFrame libraries start moving to leveraging GPU resources, it would be useful to see if we could leverage the work already done in pandas and polars for interoperating with cuDF to give a similar experience in DataFusion.

Describe the solution you'd like

Evaluate the level of effort and technical limitations to using cuDF to evaluate DataFrames. Also worth evaluating is their c++ interface which we could potentially bring in to DataFusion upstream if we are willing to write the appropriate wrappers.

Describe alternatives you've considered

Leave as is.

Additional context

This task is really just focused on researching what would be required and if there is an opportunity here.

andygrove commented 1 week ago

I have some experience in this area. While at NVIDIA, I created a POC with Rust bindings around cuDF and then provided interoperability with arrow-rs. Unfortunately, that code was internal and not open-source. I used cxx to create the bindings.

This repo (datafusion-python) once contained a prototype of translating DataFusion logical plan to cuDF operations (all in Python). It is still there in the history somewhere.

andygrove commented 1 week ago

I see that there is now one RAPIDS library that provides Rust bindings: https://docs.rapids.ai/api/cuvs/nightly/rust_api/ so it may be interesting to see what approach they took to wrap C++ in this case.

edit: cuvs is using bindgen

drauschenbach commented 22 hours ago

This repo (datafusion-python) once contained a prototype of translating DataFusion logical plan to cuDF operations (all in Python). It is still there in the history somewhere.

Possibly #602.