Closed mgrover1 closed 1 year ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
This pull request is being automatically built with GitHub Actions and Netlify. To see the status of your deployment, click below.
🔍 Git commit SHA: 1123c4afc1f2d0fffb9249108d644b42ac5ad0c3 ✅ Deployment Preview URL: https://637d28bad6baf6241278f38a--pythia-foundations.netlify.app
Hey @mgrover1 Thanks for putting this together. Can you add this file to the toc so it appears in the navbar?
Hey @mgrover1 Thanks for putting this together. Can you add this file to the toc so it appears in the navbar?
Added that in! Thanks for the reminder!
I really like that this description of Dask is tightly scoped to just introducing Dask Arrays. A lot of other dask tutorials lead someone down the path of "futures" and other much more advanced topics. A couple of brief thoughts:
.visualize()
-- great to see that what is going on as displayed a series of steps as a task graph.LocalCluster
/Client
? I think the built-in threaded model is more than sufficient for working with dask arrays (even within xarray) in many situation. Setting up dask distributed is another level of complexity not needed within Foundations..compute()
to demonstrate how you turn a lazy array into a real array. You mention .persist()
only in passing -- while clearly an important feature, I think it is out of scope in an intro notebook. And I don't think you are actually using .persist(), right?from dask.diagnostics import ProgressBar
with ProgressBar():
computed_ds = ds.compute()
environment.yml
needs python-graphviz
Thanks @jmunroe for the feedback!! I appreciate it - I will be sure to incorporate those suggestions.
@mgrover1 Could you ping me to take a look after you've addressed @jmunroe 's comments?
@mgrover1 Thanks for putting this together. I think the content fits well in the context of the foundations book as it covers a lot of use case without overloading the reader with a bunch of details. Here are some things I notices as I read through the notebook.
This notebook
alsodemonstrates one of xarray’s most powerful features: the ability to wrap dask arrays and allow users to seamlessly execute analysis code in parallel.
For example, consider taking the sum of a billion numbers. We might instead break up the array into 1,000 chunks, each of size 1,000,000, take the sum of each chunk, and then take the sum of the intermediate sums.
We also saw that we can use
Xarray
to accessdask.arrays
, which required passing achunks
argumenttoupon opening the dataset. Once the data were loaded into Xarray, wecancould interact with thexarray.Datasets
andxarray.DataArrays
as we would if we [were] working withdask.arrays
. This can be a powerful tool when working with these datasets!
I hope these suggestions help and are not to nit picky.
@jukent I responded to @jmunroe and @jnmorley 's comments here, it is ready for another review
@brian-rose - would you be willing to take a look at this content? We need one more review :)
@brian-rose - this is fixed in the latest build https://636ab705260c5c09049be5d4--pythia-foundations.netlify.app/core/xarray/dask-arrays-xarray.html
Thanks for the feedback @brian-rose !! I appreciate it! I think we can cover open_mfdataset
in a followup PR
@mgrover1 just checking in this. We identified this PR as "low hanging fruit" for the content push prior to the AGU meeting (https://discourse.pangeo.io/t/project-pythia-content-priorities-before-agu-ams/2914).
Do you expect to be able to incorporate the last round of edits sometime in the next two weeks?
@brian-rose sorry for the delay - it is ready for another review. I made those changes you requested.
@mgrover1 almost there! I spotted just a couple of formatting fixes still unresolved. Probably easier if I just fix them now and push onto this branch. Then I think we're done with this.
@mgrover1 almost there! I spotted just a couple of formatting fixes still unresolved. Probably easier if I just fix them now and push onto this branch. Then I think we're done with this.
Sounds like a plan! Thank you for your feedback here!
Link checker failure is unrelated, seems to be a service outage of some kind. I think this is good to go.
@brian-rose feel free to merge when ready :)
Closes #108
This adds content related to xarray and Dask. Instead of digging deep into a dedicated dask section, we add a bit on parallel and distributed computing with Dask + Xarray here.