Closed Huilin-Li closed 1 month ago
dask uses pandas in the backend, which doesn't use the GPU.
You can use the dask-cudf integration if you want to leverage GPUs
Once you've installed cudf
you can just set it to be the default backend for your dask.dataframe
operations.
import dask
import dask.dataframe as dd
# Set cudf as dask.dataframe backend
dask.config.set({"dataframe.backend": "cudf"})
# Carry on as normal
#1. read data
ddf = dd.read_parquet("./fa.parquet") # ~11M rows
# 2. lots of apply functions, groupby operations
# number of rows will increase to ~5billions
# 3. save to parquet and read again for apply functions, groupby again.
dask.dataframe
works very well onapply
functions,groupby
on 11M rows. Save large rows toparquet
might be the problem, but it is going to be solved. I aim to filter a small size after lots ofapply
functions andgroupby
operations.It looks like
dask.dataframe
can handle it although without GPU?