dask / community

For general discussion and community planning. Discussion issues welcome.
20 stars 3 forks source link

when setting dask.config.set({"dataframe.backend": "cudf"}), ddf.explode("col1") cannot work correctly anymore? #392

Closed Huilin-Li closed 3 months ago

Huilin-Li commented 3 months ago

Without setting dask.config.set({"dataframe.backend": "cudf"}) , the calculation process works fine, but very slow, so then setting dask.config.set({"dataframe.backend": "cudf"}). But, I got this error:

Traceback (most recent call last):
  File "/storage/.../MYANAWORK/myflsh.py", line 52, in <module>
    exp_mykmers = ddf.explode('mykmers')
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/.../miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask_expr/_collection.py", line 3261, in explode
    return new_collection(expr.ExplodeFrame(self, column=column))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/.../miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask_expr/_collection.py", line 4779, in new_collection
    meta = expr._meta
           ^^^^^^^^^^
  File "/home/.../miniconda3/envs/cudf_dev/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/.../miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask_expr/_expr.py", line 496, in _meta
    return self.operation(*args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/.../miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask/utils.py", line 1241, in __call__
    return getattr(__obj, self.method)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/.../miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/.../miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/core/dataframe.py", line 7531, in explode
    return super()._explode(column, ignore_index)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/.../miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/.../miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/core/indexed_frame.py", line 5188, in _explode
    if not isinstance(self._data[explode_column].dtype, ListDtype):
                      ~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/home/../miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/core/column_accessor.py", line 148, in __getitem__
    return self._data[key]
           ~~~~~~~~~~^^^^^
TypeError: unhashable type: 'list'

More details are here link

fjetter commented 3 months ago

cc @dask/gpu

jacobtomlinson commented 3 months ago

Unfortunately this isn't the right place to report this. Could you open an issue on the cudf repo with a minimal bug report. Something that the cudf team can copy/paste to reproduce the problem.

rjzamora commented 3 months ago

Hi @Huilin-Li!

As @jacobtomlinson suggested - Please do raise an issue in dask/dask or cudf

I think we will need to know more about the data you are calling explode on. I don't think explode is supported by the "cudf" backend when query-planning is enabled. However, there also seem to be problems with explode when "pandas" backend is used as well. E.g

import dask
import dask.dataframe as dd

dask.config.set({"dataframe.backend": "pandas"})

df = dd.from_dict({"A": [[0, 1, 2], [], [3, 4]]}, 1)
df.explode("A")

(I get an error for both backends here. So please include a specific reproducer like this in your dask issue)

jameslamb commented 3 months ago

Linking the issue that was opened, for those finding this from search: https://github.com/rapidsai/cudf/issues/16458