COSIMA / cosima-cookbook

Framework for indexing and querying ocean-sea ice model output.
https://cosima-recipes.readthedocs.io/en/latest/
Apache License 2.0
57 stars 26 forks source link

Optimisation/best-practice xarray and dask programming patterns #210

Open aidanheerdegen opened 3 years ago

aidanheerdegen commented 3 years ago

Many people report problems with running calculations on large datasets, and would like some general advice on the best approaches for tackling large problems.

There are lots of parameters that determine the success/efficiency of a calculation:

  1. Order of operations
  2. Calculating intermediate results
  3. Dask chunking
  4. netCDF chunking on disk
  5. Number of dask workers (or not using a scheduler/dask at all)
  6. Number of threads and amount of memory per worker

It becomes very complex very quickly.

One approach is to have some representative test calculations that can then be used as a target for optimisation. These test calculations can be run whenever there are infrastructure or algorithm changes to check there has been no degradation in performance, or if they might be further improved.

If that sounds like a useful idea then we need people to propose calculations that they know to be strenuous as possibilities for optimisation/best-practicification*. Ideally these would be fairly compact, reproducible chunks of code.

ping @AndyHoggANU @aekiss @adele-morrison @navidcy @angus-g

navidcy commented 2 years ago

OK, here's one!

https://gist.github.com/navidcy/b12e5469d1a809cc4c9b447456da1fe5

(better viewed in nbviewer)

cc: @ongqingyee and @angus-g. @angus-g this is the one I was chatting with you yesterday

navidcy commented 2 years ago

OK, here's one!

https://gist.github.com/navidcy/b12e5469d1a809cc4c9b447456da1fe5

(better viewed in nbviewer)

cc: @ongqingyee and @angus-g. @angus-g this is the one I was chatting with you yesterday

I'm guessing that I should save the interpolated fields and reload them... But this might be just my random (or semi-educated) guess...

navidcy commented 2 years ago

Actually, now I noticed that this MnWE might not be as relevant here since it does not use the cookbook... Oh well....

angus-g commented 2 years ago

The cookbook only really wraps the act of getting the data in the first place, so it's the actual (attempted) computation that's more important IMO. Thanks for the example! I'll take a look

access-hive-bot commented 1 year ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-cookbook-updating-needs/130/2