Open phofl opened 5 months ago
Thanks @phofl, a couple quick questions:
First, is there a list of known incompatibilities (I guess https://github.com/dask-contrib/dask-expr?tab=readme-ov-file#api-coverage, if it's up to date?). It'd be good to include that in the migration guide.
Second, on the deprecation warning
Add a deprecation warning the init of dask/dataframe
Do you have specific wording in mind? Is the intent to deprecate import dask.dataframe
permanently? Or is the intent to swap out the implementation of dask.dataframe
with dask.expr
, but keep the name dask.dataframe
for user-facing code? I think this might interact with the plan for continued development of dask-expr (whether it'll be in dask/dask, and releases will be synchronized with dask, or whether it'll continue to be developed independently).
And what's the plan for packaging this up? Will you make dask-expr a required dependency of dask[dataframe] before the switch? That would make it easier for users to adapt, since the dependency would already exist. (I guess this would interact with the plan for future development of dask-expr).
It's great to see all the progress here!
First, is there a list of known incompatibilities (I guess https://github.com/dask-contrib/dask-expr?tab=readme-ov-file#api-coverage, if it's up to date?). It'd be good to include that in the migration guide.
Yes that's up to date
Do you have specific wording in mind? Is the intent to deprecate import dask.dataframe permanently? Or is the intent to swap out the implementation of dask.dataframe with dask.expr, but keep the name dask.dataframe for user-facing code? I think this might interact with the plan for continued development of dask-expr (whether it'll be in dask/dask, and releases will be synchronized with dask, or whether it'll continue to be developed independently).
dask.dataframe will stay, we just want to swap out the implementation. dask-expr will probably eventually end up in dask/dask, but for now fast ci times and not that much baggage are more helpful than merging the repositories
And what's the plan for packaging this up? Will you make dask-expr a required dependency of dask[dataframe] before the switch? That would make it easier for users to adapt, since the dependency would already exist. (I guess this would interact with the plan for future development of dask-expr).
That's something that we haven't discussed in detail, but this is not trivial to do since dask-expr requires pandas >= 2, so the dependencies would conflict, We currently raise an error that users have to install dask-expr if they enable query planning and it's not there
Makes sense, thanks.
I think the main thing to keep an eye on is how long we leave dask.dataframe
giving that warning about using dask.expr
. I'd say a short time (maybe a few months tops) would be appropriate. I wouldn't want to go too long with users learning dask from older tutorials telling them to import dask.dataframe
and have it give a warning.
Maybe even before it's merged into dask/dask, we could have dask[dataframe]
depend on dask-expr
and just have dask.dataframe.__init__
do a from dask_expr import *
.
We've been working on adding a Query optimization layer to Dask DataFrames for a while now. The project live at https://github.com/dask-contrib/dask-expr
The status quo can be summarised as follows:
The next step is to think about how we can flip users from the legacy implementation to the expression based implementation.
My suggestion is something along the lines:
This issue is mostly meant to gather feedback about how we should approach this