dask / dask-expr

BSD 3-Clause "New" or "Revised" License
86 stars 25 forks source link

Dask Expr Dataframes are no longer instances of dd.core.DataFrame #1135

Closed andoni-garcia-fgp closed 3 weeks ago

andoni-garcia-fgp commented 1 month ago

Describe the issue:

When including Dask Expr in my dependencies, my default Dask Dataframes are no longer instances of dd.core.DataFrame. This leads to subtle bugs during isinstance(ddf, dd.DataFrame) checks depending on the exact library call used to construct the DataFrame.

Minimal Complete Verifiable Example:

test_ddf = dd.from_dict({}, npartitions=1)
type(test_ddf)
> <class 'dask_expr._collection.DataFrame'>
isinstance(test_ddf, dd.core.DataFrame)
> False
isinstance(test_ddf, dd.DataFrame)
> True

test_pdf = pd.DataFrame()
test_ddf = dd.from_pandas(test_pdf, npartitions=1)
isinstance(test_ddf, dd.DataFrame)
> True
isinstance(test_ddf, dd.core.DataFrame)
> False

test_ddf = dd.io.io.from_pandas(test_pdf, npartitions=1)
isinstance(test_ddf, dd.DataFrame)
> False
isinstance(test_ddf, dd.core.DataFrame)
> True

Anything else we need to know?:

Environment:

Installed via poetry:

dask = {extras = ["complete", "distributed"], version = "^2024.5.2"}
dask-expr = "^1.1.2"
[[package]]
name = "dask"
version = "2024.6.2"

[package.extras]
array = ["numpy (>=1.21)"]
complete = ["dask[array,dataframe,diagnostics,distributed]", "lz4 (>=4.3.2)", "pyarrow (>=7.0)", "pyarrow-hotfix"]
dataframe = ["dask-expr (>=1.1,<1.2)", "dask[array]", "pandas (>=1.3)"]
diagnostics = ["bokeh (>=2.4.2)", "jinja2 (>=2.10.3)"]
distributed = ["distributed (==2024.6.2)"]
test = ["pandas[test]", "pre-commit", "pytest", "pytest-cov", "pytest-rerunfailures", "pytest-timeout", "pytest-xdist"]

[[package]]
name = "dask-expr"
version = "1.1.6"

[package.extras]
analyze = ["crick", "distributed"]
phofl commented 1 month ago

This is intended, you should use dd.DataFrame, dd.core is considered private

phofl commented 3 weeks ago

Closing this for now