coiled / dask-community

Issue tracker for the Dask community team
MIT License
2 stars 0 forks source link

[Stack Overflow] Best way to perform arbitrary operations on groups with Dask DataFrames #94

Closed github-actions[bot] closed 2 years ago

github-actions[bot] commented 2 years ago

I want to use Dask for operations of the form

df.groupby(some_columns).apply(some_function)

where some_function() may compute some summary statistics, perform timeseries forecasting, or even just save the group to a single file in AWS S3.

Da…


Would you like to know more?

Read the full article on the following website:

https://stackoverflow.com/questions/70265639/best-way-to-perform-arbitrary-operations-on-groups-with-dask-dataframes

ian-r-rose commented 2 years ago

This is a nice SO question which has a partial answer from Sultan, but which we might be able to complete

scharlottej13 commented 2 years ago

Sorry I've been a bit slow on this one! I don't think I quite understood the question until Sultan's more recent post.

I'd summarize Sultan's answers as tentatively saying that in a Dask groupby, the groups passed into the groupby can communicate even if they are split across partitions, because of shuffling. The 'proof' here is that shuffling shows up in the task graph, but surely there's a better way to show this? Maybe something along the lines of an updated version of this post? I think this page in the docs also touches on the SO question, but I would probably want a bit more reassurance that my model is actually using the data I want it to.

There's also the issue Sultan brought up about this docstring being outdated, and it seems like it hasn't been edited in the past ~5 years.

ian-r-rose commented 2 years ago

Yeah, Sultan's answer looks like a good one. I think a follow-up would be to open an issue/PR against the dask docs correcting it

ian-r-rose commented 2 years ago

@scharlottej13 might be nice to link to your PR fixing the docs here: https://github.com/dask/dask/pull/8507

scharlottej13 commented 2 years ago

good idea, done!

ian-r-rose commented 2 years ago

Great, I think this can be closed