Closed github-actions[bot] closed 2 years ago
This is a nice SO question which has a partial answer from Sultan, but which we might be able to complete
Sorry I've been a bit slow on this one! I don't think I quite understood the question until Sultan's more recent post.
I'd summarize Sultan's answers as tentatively saying that in a Dask groupby, the groups passed into the groupby can communicate even if they are split across partitions, because of shuffling. The 'proof' here is that shuffling shows up in the task graph, but surely there's a better way to show this? Maybe something along the lines of an updated version of this post? I think this page in the docs also touches on the SO question, but I would probably want a bit more reassurance that my model is actually using the data I want it to.
There's also the issue Sultan brought up about this docstring being outdated, and it seems like it hasn't been edited in the past ~5 years.
Yeah, Sultan's answer looks like a good one. I think a follow-up would be to open an issue/PR against the dask docs correcting it
@scharlottej13 might be nice to link to your PR fixing the docs here: https://github.com/dask/dask/pull/8507
good idea, done!
Great, I think this can be closed
I want to use Dask for operations of the form
where
some_function()
may compute some summary statistics, perform timeseries forecasting, or even just save the group to a single file in AWS S3.Da…
Would you like to know more?
Read the full article on the following website:
https://stackoverflow.com/questions/70265639/best-way-to-perform-arbitrary-operations-on-groups-with-dask-dataframes