Open hdrake opened 4 years ago
Awesome!
- Basic data types that make handling large NetCDF-like datasets efficient (low overhead), effortless (intuitive and compact syntax), scalable (distributable, out-of-memory), and extendable (flexible and simple data structure types).
I'm wondering if we can use directly NCDatasets.jl for that point. It is certainly efficient and effortless and extendable. The part about scalability is less clear though. There is support for larger than RAM datasets and some things about using Dagger has been done but I must admit I still haven't had the time to test those features.
With that being said, ESDL is perhaps a more generic candidate (with support for other format). From my test, it seems to check all the boxes. However, I'm not certain how to configure everything for a distributed approach: exposing (and using) the cluster in ESDL.
2. Organizational packages that simplify the workflow (e.g. organizing model ensembles, models vs. observations, downloading scripts) 3. Generic utility packages that extend the functionality of 1-type packages (e.g. for problem-specific, dimension-specific, or operation-specific uses).
This was my initial aim with ClimateTools. I'm not there yet and my work focus at the time was to implement a quantile-quantile bias correction technique. A lot of time was spent though on working to implement extraction and utility functions. Ideally, those shouldn't be necessary now that we have newer packages for that.
I don't really know how much sense it makes in putting effort to develop 2, 3, and 4 if we haven't yet settled on a stable 1.
Totally agree!
See https://github.com/esa-esdl/ESDL.jl/issues/170
It covers a lot of material for point #1.
It covers a lot of material for point #1.
Great! Also please see #2 where I added a somewhat related post
See https://github.com/JuliaClimate/meta/issues/3#issuecomment-594679533 about a notebook stack that we started putting together
It would be nice to have some examples of the existing JuliaClimate stack in action in notebooks (if there redundancies, then maybe even feature / performance comparisons?).
My personal goal for the next few weeks is to basically implement a Julia translation of my "big data" Python tutorial [binder link] which uses the following stack:
xarray
anddask
as the building blocks for reading and analyzing out-of-memory labelled NetCDF arraysintake
(generic organizational tool leveraging xarray) to read in netcdf files stored in a cloud-optimized zarr format in Google Cloud storage as xarray.Dataset instancesxgcm
for doing grid-aware operations (e.g. differentiation) on datasetsxmitgcm
(model-specific package) to process the dataset's non-rectangular native grid into something more rectangularThese are basically the four categories of packages that I see as necessary to replicate the kinds of workflows that I am interested (and which are extremely straight-forward using the existing Pangeo Python-stack, as in my example above):
I don't really know how much sense it makes in putting effort to develop 2, 3, and 4 if we haven't yet settled on a stable 1.