dask / community

For general discussion and community planning. Discussion issues welcome.
19 stars 3 forks source link

Migrating from current dask/dataframe implementation to dask-expr #361

Open phofl opened 5 months ago

phofl commented 5 months ago

We've been working on adding a Query optimization layer to Dask DataFrames for a while now. The project live at https://github.com/dask-contrib/dask-expr

The status quo can be summarised as follows:

The next step is to think about how we can flip users from the legacy implementation to the expression based implementation.

My suggestion is something along the lines:

This issue is mostly meant to gather feedback about how we should approach this

TomAugspurger commented 5 months ago

Thanks @phofl, a couple quick questions:

First, is there a list of known incompatibilities (I guess https://github.com/dask-contrib/dask-expr?tab=readme-ov-file#api-coverage, if it's up to date?). It'd be good to include that in the migration guide.

Second, on the deprecation warning

Add a deprecation warning the init of dask/dataframe

Do you have specific wording in mind? Is the intent to deprecate import dask.dataframe permanently? Or is the intent to swap out the implementation of dask.dataframe with dask.expr, but keep the name dask.dataframe for user-facing code? I think this might interact with the plan for continued development of dask-expr (whether it'll be in dask/dask, and releases will be synchronized with dask, or whether it'll continue to be developed independently).

And what's the plan for packaging this up? Will you make dask-expr a required dependency of dask[dataframe] before the switch? That would make it easier for users to adapt, since the dependency would already exist. (I guess this would interact with the plan for future development of dask-expr).

It's great to see all the progress here!

phofl commented 5 months ago

First, is there a list of known incompatibilities (I guess https://github.com/dask-contrib/dask-expr?tab=readme-ov-file#api-coverage, if it's up to date?). It'd be good to include that in the migration guide.

Yes that's up to date

Do you have specific wording in mind? Is the intent to deprecate import dask.dataframe permanently? Or is the intent to swap out the implementation of dask.dataframe with dask.expr, but keep the name dask.dataframe for user-facing code? I think this might interact with the plan for continued development of dask-expr (whether it'll be in dask/dask, and releases will be synchronized with dask, or whether it'll continue to be developed independently).

dask.dataframe will stay, we just want to swap out the implementation. dask-expr will probably eventually end up in dask/dask, but for now fast ci times and not that much baggage are more helpful than merging the repositories

And what's the plan for packaging this up? Will you make dask-expr a required dependency of dask[dataframe] before the switch? That would make it easier for users to adapt, since the dependency would already exist. (I guess this would interact with the plan for future development of dask-expr).

That's something that we haven't discussed in detail, but this is not trivial to do since dask-expr requires pandas >= 2, so the dependencies would conflict, We currently raise an error that users have to install dask-expr if they enable query planning and it's not there

TomAugspurger commented 5 months ago

Makes sense, thanks.

I think the main thing to keep an eye on is how long we leave dask.dataframe giving that warning about using dask.expr. I'd say a short time (maybe a few months tops) would be appropriate. I wouldn't want to go too long with users learning dask from older tutorials telling them to import dask.dataframe and have it give a warning.

Maybe even before it's merged into dask/dask, we could have dask[dataframe] depend on dask-expr and just have dask.dataframe.__init__ do a from dask_expr import *.