Example notebooks exhibiting the JuliaClimate stack

hdrake commented 4 years ago

It would be nice to have some examples of the existing JuliaClimate stack in action in notebooks (if there redundancies, then maybe even feature / performance comparisons?).

My personal goal for the next few weeks is to basically implement a Julia translation of my "big data" Python tutorial [binder link] which uses the following stack:

xarray and dask as the building blocks for reading and analyzing out-of-memory labelled NetCDF arrays
intake (generic organizational tool leveraging xarray) to read in netcdf files stored in a cloud-optimized zarr format in Google Cloud storage as xarray.Dataset instances
xgcm for doing grid-aware operations (e.g. differentiation) on datasets
xmitgcm (model-specific package) to process the dataset's non-rectangular native grid into something more rectangular

These are basically the four categories of packages that I see as necessary to replicate the kinds of workflows that I am interested (and which are extremely straight-forward using the existing Pangeo Python-stack, as in my example above):

Basic data types that make handling large NetCDF-like datasets efficient (low overhead), effortless (intuitive and compact syntax), scalable (distributable, out-of-memory), and extendable (flexible and simple data structure types).
Organizational packages that simplify the workflow (e.g. organizing model ensembles, models vs. observations, downloading scripts)
Generic utility packages that extend the functionality of 1-type packages (e.g. for problem-specific, dimension-specific, or operation-specific uses).
Model-specific utility packages that rely on model-specific metadata (e.g. some of what @natgeo-wong and @gaelforget are working on).

I don't really know how much sense it makes in putting effort to develop 2, 3, and 4 if we haven't yet settled on a stable 1.

Balinus commented 4 years ago

Awesome!

Basic data types that make handling large NetCDF-like datasets efficient (low overhead), effortless (intuitive and compact syntax), scalable (distributable, out-of-memory), and extendable (flexible and simple data structure types).

I'm wondering if we can use directly NCDatasets.jl for that point. It is certainly efficient and effortless and extendable. The part about scalability is less clear though. There is support for larger than RAM datasets and some things about using Dagger has been done but I must admit I still haven't had the time to test those features.

With that being said, ESDL is perhaps a more generic candidate (with support for other format). From my test, it seems to check all the boxes. However, I'm not certain how to configure everything for a distributed approach: exposing (and using) the cluster in ESDL.

2. Organizational packages that simplify the workflow (e.g. organizing model ensembles, models vs. observations, downloading scripts) 3. Generic utility packages that extend the functionality of 1-type packages (e.g. for problem-specific, dimension-specific, or operation-specific uses).

This was my initial aim with ClimateTools. I'm not there yet and my work focus at the time was to implement a quantile-quantile bias correction technique. A lot of time was spent though on working to implement extraction and utility functions. Ideally, those shouldn't be necessary now that we have newer packages for that.

I don't really know how much sense it makes in putting effort to develop 2, 3, and 4 if we haven't yet settled on a stable 1.

Totally agree!

Balinus commented 4 years ago

See https://github.com/esa-esdl/ESDL.jl/issues/170

It covers a lot of material for point #1.

gaelforget commented 4 years ago

See esa-esdl/ESDL.jl#170

It covers a lot of material for point #1.

Great! Also please see #2 where I added a somewhat related post

gaelforget commented 4 years ago

See https://github.com/JuliaClimate/meta/issues/3#issuecomment-594679533 about a notebook stack that we started putting together

JuliaClimate / meta

Example notebooks exhibiting the JuliaClimate stack #4