dask / scipy-tutorials-2018

5 stars 1 forks source link

Introduction: Use Dask in SciPy Tutorials? #3

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

tl;dr: do you want to collaborate on a small scalability section using Dask within your SciPy tutorial?

Hello everyone,

I'm excited about the SciPy tutorial lineup this year. Dask devs plan to set up some infrastructure to give students in our tutorial access to a modest cluster in the cloud so that they can do some scalable analysis. These distributed systems have been popular in our tutorials in the past. It will likely be similar to the public pangeo deployment currently used within earth sciences (JupyterHub + Dask on Kubernetes on Google). We were also planning to extend this infrastructure to a couple other groups (scikit-learn, pandas, ...) for small dask sections at the end of their tutorial but, after seeing the lineup this year, thought it might be best to reach out to others to see if a broader collaboration might be more interesting.

So, concretely, do you want to collaborate on a small section in your tutorial that shows how to scale your domain and libraries using Dask? This would require the following from you:

  1. The development of educational notebooks and examples that use Dask within your domain
  2. A matching software environment (requirements.txt, environment.yaml, etc.) that matches those notebooks
  3. Some learning about Dask to ensure that you're comfortable delivering this content

Dask and JupyterHub developers have some availability to help assist tutorial leaders develop these materials and manage infrastructure so that students have access to distributed resources during tutorials.

What Next?

If you're interested in exploring this topic then please raise a new issue within this repository with the title of your tutorial and some thoughts and questions (there are some leading questions within the issue template to help start conversation). I imagine that most people haven't started writing or updating materials yet, so I would expect early conversation to be pretty exploratory. Perhaps we can explore applications together that might be both interesting and accessible to beginning students.

General questions are also welcome here, though please note that many people are cc'ed on this issue, and so raising new issues within this repository might be best to avoid all-to-all e-mail chatter.

Who

To avoid exclusion I've included the top author listed on all tutorials. However I expect that this will make more sense in some cases (introduction to numpy) than in others (introduction to Julia) but I would love to be surprised :) To those for whom this is not a good fit I sincerely apologize for the unnecessary e-mail. You may wish to unsubscribe from this issue.

  1. Introduction to Python and Programming @jiffyclub
  2. PyViz: Easy Visualization and Exploration for all your Data @jbednar
  3. Around the World in 80 Ways: An Introduction to Working with Geodata and Cartopy @pelson
  4. Network Analysis Made Simple: Network Fundamentals @ericmjl
  5. Getting Started with TensorFlow @random_forests
  6. Introduction to Numerical Computing with NumPy @achabotl
  7. 3D Visualization with Mayavi @prabhuramachandran
  8. An Introduction to Julia @xorJane
  9. Image Analysis in Python with SciPy and scikit-image @stefanv
  10. Software Engineering Techniques @jiffyclub
  11. Information Extraction Using Topic Models @parulsethi
  12. Anatomy of Matplotlib @weathergod
  13. Machine Learning with scikit-learn @amueller
  14. Getting Started with JupyterLab @carreau
  15. Introduction to Geospatial Data Analysis with Python @sjsrey @jorisvandenbossche
  16. Scientific MicroPython on Microcontrollers @rcolistete
  17. Setting Up Your Own Open Source Project @dopplershift
  18. pandas .head() to .tail() @deniederhut @TomAugspurger
  19. Hands-on Satellite Imagery Analysis @sarasafavi
  20. The Jupyter Interactive Widget Ecosystem @mwcraig
  21. The Sheer Joy of Packaging @msarahan
  22. Bayesian Data Science Two Ways: Simulation and Probabilistic Programming @ericmjl

Also cc @yuvipanda, @choldgraf, and @willingc from JupyterHub

Thank you all for your time, -matt

msarahan commented 6 years ago

I have no idea how we'd work dask meaningfully into a packaging tutorial, but if you have ideas on that, I'd be open to discussion.

mrocklin commented 6 years ago

I agree that this may not make sense for many of the projects here, including packaging.

If you'd like to open a discussion then I recommend raising a separate issue on the issue tracker of this repository (just to avoid spamming the others here).

If you're fairly confident that there isn't much opportunity here then please feel free to ignore this entirely.

Thanks!

Carreau commented 6 years ago

cc @jasongrout.

In the JupyterLab tutorial we can show what the dask lab extension does (and that it is an extension).

amueller commented 6 years ago

I'm not sure how much time I'll have to work that into our tutorial. I think the most interesting application might be parallelizing on a single machine, I'm not sure how well that works with your setup? What's the status on the broad-casting of data for doing a random forest in parallel?

mrocklin commented 6 years ago

Folks, for the sake of everyone cc'ed on this issue please do not respond to this issue.

Please respond in new issues.

@amueller I've responded to your comment in https://github.com/dask/scipy-tutorials-2018/issues/7

Thank you @jasongrout for responding to @Carreau in #4

jasongrout commented 6 years ago

@mrocklin - you might even lock this thread, and put a note to that effect in the top-level description.