Open mathause opened 3 years ago
@znicholls same comment as for #105 but probably even more important to get your input on this at some point!
I think it depends on the intended access pattern. In my head, the access pattern for MESMER is always (and if this assumption is wrong, ignore everything I've written):
Given this, calibrating multiple models is an embarrassingly parallel problem (you can calibrate to multiple CMIP models at once, you don't need information from one calibration to inform another one). Hence I would build the internal calibration data structure for the one model/calibration case (so things like needing to store the model as the dimension disappear, and we can just use scenario/experiment as the dimension). That should help simplify that scope. However, we will then need a second (and maybe third) 'outer-layer' structure which can handle the case where we want to calibrate multiple things at once. That outer-layer will need to handle things like parallelisation, knowing which functions to call, how to pass parameters etc. However, having multiple layers means that we separate the concern of each layer, which should make implementing them easier. The other-layer we'd probably want is an output storage layer (i.e. some way to store the outputs from multiple calibrations yet still be able to search them). I suspect that would be fairly easy to build though because the storage layer would only have one concern (so doing some slightly sneaky things is probably fine).
If all the above is correct, then for our internal layer I would use either a DataArray
(and I would just let the user/outer-layer decide whether historical and scenario should be joint or treated separately) or DataList
(having each experiment to use in calibration as one element in the list).
For the outer layer (i.e. the one that handles multiple model calibrations) then DataList
or even just boring tuples
or dictionaries
should be fine as this outer layer won't be handling any big data (all it's doing is kicking off calibrations and making sure the results get saved in the right place).
For the storage layer I would go with DataList
and then we can do some thin wrappers around to make it easy to search things as needed. Maybe a better option would be a class like below which allows you to search easily making the most of pandas 2D data handling whilst also making it hard to break the disconnect between the data and the metadata.
class MesmerOutputStore
def __init__(self):
self._meta = # load a 2D table which stores simple metadata info about the data
self._calibrations = # load from a netCDF file which actually has all the calibrations
def get_calibration(**kwargs):
"""Get calibrations""""
selected = self._meta.copy()
# setting it up this way makes it super easy to then go and filter/search
for k, v in kwargs.items():
selected = selected[selected[k] == v]
# storing metadata in the table means that xarray doesn't have to store the metadata
# so we don't run into the multi-dimensional sparsity issue
return self._calibrations.sel(id=selected["id"])
We've actually got this kind of wrapper in https://github.com/openscm/scmdata. It basically has a meta table (https://github.com/openscm/scmdata/blob/029707c57427608026b77f1e263947d8b2a06fac/src/scmdata/run.py#L951) and then a pandas dataframe (cause we only deal with timeseries data). Filtering can then be done really fast (https://github.com/openscm/scmdata/blob/029707c57427608026b77f1e263947d8b2a06fac/src/scmdata/run.py#L1074) because you only search meta before deciding what data to return. Storing the meta as a 2D table means that things don't explode even if you have super sparse data. We wouldn't want all of ScmRun here, but the filtering stuff could be really handy, and we just put a different data model with it.
In my head, the access pattern for MESMER is always [...]
I agree with this access pattern, it is clear to me.
Given this, calibrating multiple models is an embarrassingly parallel problem (you can calibrate to multiple CMIP models at once, you don't need information from one calibration to inform another one). Hence I would build the internal calibration data structure for the one model/calibration case (so things like needing to store the model as the dimension disappear, and we can just use scenario/experiment as the dimension)
If by model
, you mean ESM, not statistical model, i would say that the data structure does not matter much. Whether we have one huge DataArray, a DataList or a DataTree, it would not matter that much for the calibration on one ESM or several ESMs. In my view, the default of MESMER should be 1 calibration / 1 ESM, which means that it would need anyway to reshape the data to calibrate on several ESMs at once, these different layers that you mention.
Though, for the outer layer, i would stick to DataList
. Tuples or dictionaries are as handy as such as a DataList would be.
For the calibration, i slightly prefer the huge DataArray to the DataList. I consider the former easier to navigate and use rather than a list. In terms of RAM, it can be handled, from my estimation, it is about 10Gb if we keep the gridpoint axis only, not the maps. Besides, by not .load()
the netCDF directly into the memory, i think it should be ok, at the cost of more I/O reading. However, for emulations, we cannot have everything in the RAM at once.
On the emulation part, the current situation -as far as i know- is that the historical part is emulated for each scenario, while it is the same. Over CMIP6, for 5 scenarios, it would cut the amount to calculate by half. This point should be considered for the data and the emulation.
IF we go for a huge DataArray, I think that we can simply load an empty array using np.empty( tuple_dim_sizes )
filled in a loop, or a masked_array
, to deal with the Fill_Value
.
And yes, it would cause issues if ESMs have different grids... But there would be issues raised at other points, for instance on the geo distance matrix during the training of the local variability.
IF we stick to DataLists for each layer, I would definitely like such a wrapper to gather more easily the correct items. And actually, such a structure may be closer to what scm does, then that is another plus.
I refer to https://github.com/MESMER-group/mesmer/pull/109#issuecomment-953332847 from @znicholls (sorry my comments are all over the place). I thought a bit more how we can avoid concatenating hist + proj and having very sparse arrays and also how a DataList
structure could work in mesmer. Assuming we have a DataList which looks like:
dl = DataList(
(da, {scen="historical", model="CESM", ...}),
(da, {scen="ssp126", model="CESM", ...}),
(da, {scen="ssp585", model="CESM", ...}),
)
i.e. one entry per scenario and model, where da has dims time x gridpoint x member
. Then it should be 'relatively elegant' to calculate anomalies and flatten the DataArray
objects.
def calculate_anomalies(dl : DataList, period=slice("1850", "1900"), dim="time"):
out = DataList()
# loop all simulations
for ds, meta in dl:
# find the historical simulation
hist = dl.select(model=meta["model"], scen="historical")
if how = "individual":
# given this makes an inner join it will work
ds = ds - hist.sel({dim: period}).mean(dim)
elif how = "all":
ds = ds - hist.sel({dim: period}).mean()
else:
raise ValueError()
out.append(ds, meta)
We loop over all elements in dl, select the corresponding historical simulation and subtract it. For how="individual"
we rely on xarray to align ds
and hist
: the subtraction does an inner join - so we should get rid of all projections that do not have a corresponding historical simulation.
def expand_dims(ds, meta):
return ds.expand_dims(scen=meta["scen"])
def stack_except(ds, dim):
dim = [dim] if isinstance(dim, str) else dim
dims = set(ds.dims) - set(dim)
return ds.stack(stacked=dims)
dl = dl.map(expand_dims, pass_meta=True)
dl = dl.map(stack_except, dim="cell")
dl = dl.concat(along="stacked")
We add model
as dimension to each da
, then we stack all dimensions except "cell"
, and finally we concatenate along the new stacked dimension.
Then it should be 'relatively elegant' to calculate anomalies and flatten the
DataArray
objects
I would say this looks absolutely beautiful! Very nice indeed and a great solution to a nasty problem (yes, maybe we find some reason to have to adjust things in future but I think the benefits of trying such a great solution now outweigh the risks of having to adjust in future).
i.e. one entry per scenario and model, where da has dims time x gridpoint x member.
Yes, aligning hist with projections is a clean solution to provide us with the common members for ESM x scenarios. However, MESMER may use two variables, eg tas and hfds. Then later in the code, when using the da, we must still check for common members to tas and hfds, for they may have different sets. (@mathause, that is why I had introduced this function to select common members to tas, hfds and TXx.)
I would say this looks absolutely beautiful!
Thanks :blush: I do still have a lot of open questions....
ScmRun
you implement a lot of nice functionality but this kind of replicates a Dataset
) We don't need to answer this right now - that can be developed*.ScmRun
.)DataList
a good name for the class? What else could it be?meta
internally - dict
or pd.DataFrame
?However, MESMER may use two variables, eg tas and hfds. Then later in the code, when using the da, we must still check for common members to tas and hfds, for they may have different sets.
Yes this is indeed an issue. I see three options.
deep
option to datalist.align
which also aligns the DataArray/ Datasetsdl.align
would take care of it (without any deep
param)IIRC @leabeusch would prefer to have one DataList for all variables. My preference would be to have one per variable. What I imagine is something of the sort:
tas = mesmer.io.cmip6_ng.load(variable="tas")
hfds = mesmer.io.cmip6_ng.load(variable="hfds")
tas, hfds = dl.align(tas, hfds)
*I think we could allow to pass strings to DataList.map
which call the underlying functions - that could be a good compromise. E.g.
dl.map("sel", time=slice("1850", "1900"))
# would be turned into
for data, meta in dl:
getattr(data, "sel")(time=slice("1850", "1900"))
**The best I can usually come up with is to name the package the same as the class which is no good, because then the following happens:
import datalist as dl
dl = dl.DataList(...)
# aahrg
B.t.w. thanks for all your input! That helps a lot.
My two cents on the questions
- How far we should we go?
Maximise code reuse as much as possible. I'm very happy to do some refactoring in scmdata so that we can use the filtering capability here without picking up the rest of scmdata's assumptions.
- Is there a package we could rely on for this? (I don't know of any except maybe
ScmRun
.)
The only one I know is scmdata or pyam (they both have a similar idea for filtering). I would go for scmdata as I have more control. Maybe someone else knows of other options for this sort of filtering wrapping though.
- Is
DataList
a good name for the class? What else could it be?
I like it and if there are no other similar packages being worked on I think it's a good start. You could make clear that it's xarray specific by calling it something like DataArrayList
or DatasetList
?
- Do we have a good name for the package?**
I think datalist
is good. You can then do the following to avoid your namespace explosion
from datalist import DataList
dl = DataList(...)
A different option would be xarray_lists
?
- How do we store
meta
internally -dict
orpd.DataFrame
?
pd.MultiIndex
- means we can use categories for the various bits of metadata which helps memory usage and gives a massive speed boost for filtering (and scmdata has done all the work of making multiindexes behave).
- I would kind of like to develop datalist outside of mesmer but I am not sure how good this idea is as long as it is not stable.
Start within mesmer, then split out once we've got the interface stable enough.
one DataList for all variables
I would lean towards this (all data in one place). I would write align
so you told it what dimension to ignore when doing the alignment (e.g. align("variable")
) and then align
can raise an error if the two datalists being aligned have more than one variable. Something like
# assume dl contains both tas and hfds
# this would work fine
dl.filter(variable="tas").align(dl.filter(variable="hfds"), "variable")
# this would raise an error because there is more than one variable in dl
dl.filter(variable="tas").align(dl, "variable")
ValueError("self must have only one variable in the dimension being ignored during the alignment")
How do we store meta internally - dict or pd.DataFrame?
Actually, why not use the attributes of the DataArray? With that, we dont separate the data from its information, and whenever we need to align/concatenate/else, we can simply adapt accordingly the attributes of the new variable.
IIRC @leabeusch would prefer to have one DataList for all variables
I prefer as well having all data into a single file. Intuitively, I was thinking simply about something like that:
data = mesmer.io.cmip6_ng.load(variable=["tas", "hfds"]) # later, precipitations, climate extremes and so on
# if the format is directly as a dl/DataList, something like:
data = data.align( ["tas", "hfds"] ) # or any subset of the variables in data
And inside dl.align
, the dl.filter
proposed by @znicholls could be integrated.
I haven't understood everything going on here from a code perspective yet, so in case some of my statements don't make any sense, this could really just be because of a stupid misunderstanding -> sorry in advance for that. :sweat_smile: but I'm confident with some more time / explanations I'll catch up again eventually. :smile: I'll try to add a few things nevertheless already at this stage:
IIRC @leabeusch would prefer to have one DataList for all variables. My preference would be to have one per variable.
I just really want to keep the option for MESMER to go multivariate. Meaning: I'd like to be able to pass e.g., temperature and precip fields from a single ESM (but from various emission scenarios) at once to whatever forced response module (e.g., regression or sth fancier) & internal variability module (e.g., multivariate Gaussian after some transformations) I have available. Thus it seems weird to me to have those two variables in a different datalist. It would e.g., make more sense to have separate datalists per ESM rather than per variable to me. But maybe I have misunderstood what would need to be included in the same datalist to be able to achieve the functionality I describe?
I had a chat with Sarah on her mesmer temperature/ precipitation analysis. She currently organizes her data in one large xr.Dataset
(see below). It can be practical to have everything in one place. However, she currently only looks at one simulations (?) and what I am not entirely sure is how this will scale to several models, scenarios etc...
<xarray.Dataset>
Dimensions: (time: 1932, lat: 72, lon: 144, month: 12)
Coordinates:
* time (time) object 1920-01-16 12:00:00 ... 208...
* lon (lon) float32 -178.8 -176.2 ... 176.2 178.8
* lat (lat) float32 -88.75 -86.25 ... 86.25 88.75
* month (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
Data variables:
pr (time, lat, lon) float32 ...
tas (time, lat, lon) float32 ...
lsm (lat, lon) float32 nan nan nan ... nan nan
log_pr (time, lat, lon) float32 ...
monthly_global_tas (time) float32 266.0 266.2 ... 275.8 271.3
tas_hp (time, lat, lon) float32 ...
pr_hp (time, lat, lon) float32 ...
log_pr_hp (time, lat, lon) float32 ...
tas_res_variability (time, lat, lon) float32 ...
log_pr_res_variability (time, lat, lon) float32 ...
tas_autoreg_res_variability (time, lat, lon) float32 ...
log_pr_autoreg_res_variability (time, lat, lon) float32 ...
tas_generated_variability (time, lat, lon) float32 ...
log_pr_generated_variability (time, lat, lon) float32 ...
tas_autoregressed (time, lat, lon) float32 ...
log_pr_autoregressed (time, lat, lon) float32 ...
tas_autoreg_slope (month, lat, lon) float32 ...
tas_autoreg_offset (month, lat, lon) float32 ...
log_pr_autoreg_slope (month, lat, lon) float32 ...
log_pr_autoreg_offset (month, lat, lon) float32 ...
Attributes:
CDI: Climate Data Interface version 1.9.10 (https://mpimet.mpg.d...
Conventions: CF-1.6
history: Tue Oct 19 15:25:28 2021: cdo remapcon2,output_grid_descrip...
CDO: Climate Data Operators version 1.9.10 (https://mpimet.mpg.d...
I am not entirely sure is how this will scale to several models, scenarios etc
Yes, particularly how it would handle the very annoying history/scenario split
But it's of course super convenient because you only need to schlep one data structure around.
An alternative to a DataList
would be to (ab-)use a pandas DataFrame
- similar to geopandas. Either by subclassing or via an accessor (see extending pandas). The DataFrame
or Dataset
would then need a dedicated column and could have an accessor to do fun stuff with it. But again not sure how this would work out exactly.
A pointer to cf_xarray and it's accessor that seems to wrap various methods and classes of xarray (as a potential inspiration how to do this for a DataList
(or xrDataList
or whatever I called it).
https://github.com/xarray-contrib/cf-xarray/blob/main/cf_xarray/accessor.py
edit: forgot the link
Some updates:
/net/cfc/landclim1/mathause/projects/mesmer/datalist
I had a play with data tree and really liked it. For the kind of data we have here, it seemed the right choice (at least for container, then you just build on top of it as needed). Looking at your links, putting some of the data list ideas onto data tree (or a class that wraps a data tree) seems the best choice to me...
filefinder looks sick by the way, can't wait to use that.
@mathause I am hoping to build some sort of CMIP data processing thing this year (we need it for MAGICC calibration). Would you be interested in collaborating or is finding time for this an ongoing dream that is never realised?
I had a play with data tree and really liked it. For the kind of data we have here, it seemed the right choice (at least for container, then you just build on top of it as needed). Looking at your links, putting some of the data list ideas onto data tree (or a class that wraps a data tree) seems the best choice to me...
@veni-vidi-vici-dormivi (Victoria) started playing with datatree and it indeed looks promising
filefinder looks sick by the way, can't wait to use that.
:blush:
@mathause I am hoping to build some sort of CMIP data processing thing this year (we need it for MAGICC calibration). Would you be interested in collaborating or is finding time for this an ongoing dream that is never realised?
Yes, please keep me in the loop! I assume you have heard of jbusecke/xMIP?
Yes, please keep me in the loop! I assume you have heard of jbusecke/xMIP?
Yep. We want to be able to download our own data though so our use of it depends a bit on how tightly coupled it is to the pangeo data archives.
NOTE: this issue is very much a draft at the moment.
We should give some thought on the internal data structure in mesmer. IMHO this is one of the more important but also difficult things to decide on. Generally the idea is to move to a xarray-based data structure. However, it is unclear how we want to carry metadata around (e.g. model name, experiment, etc.).
Current status
Currently a triply nested dictionary of numpy arrays. Something that roughly looks like:
Ideas
1. Keep as is
Pros/ Cons
data["CESM"]["ssp585"]["tas"]
)2. One huge DataArray
We could potentially create one enormous DataArray that encapsulates all necessary information as coordinates (and non-dimension coordinates). It could look something like this:
Pros/ Cons
3. DataList
DataList is a data structure inspired by ESMValTool that I used extensively for my analysis of the CMIP6 data for IPCC. It is used as a flat data container to store model data (e.g.
xr.DataArray
objects) and its metadata. It is a list of data, metadata tuples:where
ds
is a data store (e.g. anxr.DataArray
) and meta is adict
containing metadata, e.g.meta = {"model": "CESM", exp: "ssp585", ...}
. This allows to (1) work with a convenient flat data structure (no nested loops), (2) store arbitrary data (e.g.xr.DataArray
objects with different grids), (3) carry around metadata without having to alter the data store.DataList structure could be turned into a class, which would allow for a convenient interface. E.g. it could allow for things like
Pros/ Cons
4. DataTree
WIP implementation of a tree-like hierarchical data structure for xarray, see https://github.com/TomNicholas/datatree
Pros/ Cons
5. Any other ideas?