dask / community

For general discussion and community planning. Discussion issues welcome.
20 stars 3 forks source link

Is dask.dataframe using GPU in this case? Will GPU speed up this process? #391

Closed Huilin-Li closed 1 month ago

Huilin-Li commented 2 months ago

dask.dataframe works very well on applyfunctions, groupbyon 11M rows. Save large rows to parquetmight be the problem, but it is going to be solved. I aim to filter a small size after lots of applyfunctions and groupbyoperations.

It looks like dask.dataframe can handle it although without GPU?

import dask.dataframe as dd

#1. read data
ddf = dd.read_parquet("./fa.parquet") # ~11M rows
# 2. lots of apply functions, groupby operations
# number of rows will increase to ~5billions
# 3. save to parquet and read again for apply functions, groupby again.
phofl commented 1 month ago

dask uses pandas in the backend, which doesn't use the GPU.

You can use the dask-cudf integration if you want to leverage GPUs

https://docs.rapids.ai/api/dask-cudf/stable/

jacobtomlinson commented 1 month ago

Once you've installed cudf you can just set it to be the default backend for your dask.dataframe operations.

import dask
import dask.dataframe as dd

# Set cudf as dask.dataframe backend
dask.config.set({"dataframe.backend": "cudf"})

# Carry on as normal
#1. read data
ddf = dd.read_parquet("./fa.parquet") # ~11M rows
# 2. lots of apply functions, groupby operations
# number of rows will increase to ~5billions
# 3. save to parquet and read again for apply functions, groupby again.