dask / community

For general discussion and community planning. Discussion issues welcome.
20 stars 3 forks source link

SciPy 2022 Submissions #214

Closed jacobtomlinson closed 2 years ago

jacobtomlinson commented 2 years ago

I wanted to open an issue to help us coordinate Dask submissions for SciPy (and potentially other conferences) this year. Deadline is Feb 11th so we have a little while.

Note that the SciPy website seems to suggest it will be in person and in Austin and doesn't mention virtual participation. So that may affect folks appetite to submit.

I expect folks like Coiled, Saturn, Anaconda, NVIDIA, etc who employ core maintainers will be submitting varying proposals. It would be nice to run a general Dask tutorial and perhaps have some kind of talk or project update that is presented by a cross-org group.

I personally am thinking about submitting a talk on the work we are doing in Dask Kubernetes at the moment.

It would be great to hear what topics folks are thinking about submitting in case there is any duplication or potential collaboration.

ncclementi commented 2 years ago

Hi there thanks for starting this thread @jacobtomlinson .

The idea of running the Dask tutorial is great, and with some of us at Coiled we were discussing that maybe we need to think about the layout/order of the topics of the current tutorial. (not sure if anyone else has similar thoughts on this)

We noticed that the over-emphasis on delayed can cause confusion in new users, and probably the average new use wants to do something like work with dataframes and read a csv file. It was also mentioned that historically there was success building understanding from the bottom up, so maybe starting with a more motivating example before going to the lower level might make more sense.

cc: @ian-r-rose @jcrist @bryanwweber

jacobtomlinson commented 2 years ago

Yeah I agree with that. In the shorter tutorial that I've been running (based on the one by @adbreind) things start with dataframes, arrays, etc and work their way down to the low level. I prefer that approach and would be keen to switch the "official tutorial" into that structure too.

@mrocklin made an interesting point a while ago (I can't remember where though) about folks totally missing the fact that Dask supports delayed and futures and painting the project as just "distributed pandas". So we should be mindful of ensuring folks know up front how deep Dask goes, even if it isn't explored in depth until the end.

bryanwweber commented 2 years ago

Hi all, I was talking with @ncclementi, @ian-r-rose, and @pavithraes this morning about this topic. We were talking about the possibility of having a short meeting with a few people who are interested in working on this content, both for presentation at SciPy and updating the online tutorial. That way we can sync ideas and divide up the work of creating any new content needed. Recognizing that there are a lot of stakeholders here, and a lot of people who might have thoughts about direction, I think it makes sense to start with anyone interested in making content first, and bring proposals to the rest of the community. I'm happy to discuss this idea on next week's community call as well! Thanks 😄

jsignell commented 2 years ago

I am super interested in this and in particular doing a good job with the difference between delayed and futures now that distributed is becoming more and more part of the recommended dask setup.

martindurant commented 2 years ago

Historically, we did vacillate between putting delayed (but not futures) up front versus starting with the high-level APIs. Probably both versions can still be found the the dask-tutorial git history or forks. That might have been before the shift to always-distributed and our very nice dashboard, which makes it much simpler to demonstrate how a single dataframe API call is turned into tasks per chunk. I could see doing it that way around now, with delayed and futures at the end as a "what if the high-level API isn't enough". We would probably miss out on workflows that are one step too complex for standard multiprocessing, but not dataframe/array, like this - more along the lines of a low-latency prefect graph.

ncclementi commented 2 years ago

Should we close this as Scipy 2022 already happened? I don't have closing powers for this repo.

cc: @jrbourbeau or @jacobtomlinson

jcrist commented 2 years ago

Yep, closed!