Open znicholls opened 2 years ago
fyi @lewisjared
First question:
Ignore my first question, I've just read the docs properly
Re the second question, I just tried the recipe below:
It gives the following output, which I think shows that the concatenation doesn't work unfortunately because of the overlap between experiments.
I think the issue is in https://github.com/ESMValGroup/ESMValCore/blob/5d5ddddd633024dee741f54768d44d32023f3a73/esmvalcore/preprocessor/_io.py#L183 and its interaction with https://github.com/ESMValGroup/ESMValCore/blob/5d5ddddd633024dee741f54768d44d32023f3a73/esmvalcore/preprocessor/_io.py#L388
The result is that everything is joint starting from the oldest experiment (i.e. historical) and always picking the longer cube so we end up with ssp585 from 2019-2068 instead of G6solar (because the G6solar experiment comes in two files whereas ssp585 is all in one so the longer ssp585 file gets picked instead of the first, shorter G6solar one). To fix it in a robust way I think you'd have to specify that G6solar was the experiment you were actually interested in and use the branch points to pick the appropriate subset of the parent experiments as you walked back up the ancestry rather than the current concatenation algorithm. I'm not sure how such a fix is best implemented though (new pre-processor?).
We discussed some aspects of this recently in #1423. Basically, I agree with you. It would be great to have greater flexibility in specifying how a concatenation operation should deal with overlaps. Adding your example is a nice extension to the use-cases that were considered before.
Perhaps we can approach this best by discussing a nice interface that could work both in recipe yaml files and in the pure Python API? Considering that we expect more than one overlap (e.g. picontrol + historical + scenario) do you think it's enough to specify one priority? Perhaps an ordered list of datasets by priority would be good?
PS: If general enough, we might feed that back upstream to Iris.
My suggestion would be to have a function that looked something like the options below
def join_to_parents(in_file):
# 1. Load/download the file
# 2. Find parent
# 3. Load/download parent file, throwing away all data after the branch point
# 4. Find parent of parent
# 5. repeat until some stop condition is hit (e.g. parent is piControl or piControl-spinup)
# 6. Return concatenation in time (now with no overlap as we throw away data after branch points)
# Pros:
# - Only needs one file as input, everything flows from there
# - Hierarchy is well-defined (only keep parent files from before the branch point)
# Cons:
# - Requires downloading file in order to work out what is needed so might not play nice with
# ESMValTool's general idea of defining all datasets to be used up front and then only downloading
# once (or at least I think that's ESMValTool's general idea, maybe I misunderstand)
def join_to_parents(child, ancestry):
"""
Parameters
------------
child : str
Path to child data
ancestry : list[str]
List of files that make up ``child``'s ancestry. Order doesn't matter.
"""
# 1. Load child
# 2. Find parent
# 3. Load parent from ancestry, keeping only data from before branch point
# 4. Find parent of parent
# 5. repeat until some stop condition is hit (e.g. parent is piControl or piControl-spinup)
# 6. Return concatenation in time (now with no overlap as we throw away data after branch points)
# Pros:
# - Can be done as a pre-processing step after all files are downloaded in more general workflow
# - Hierarchy is well-defined (only keep parent files from before the branch point)
# Cons:
# - Requires knowing what files make up the ancestry in advance (could of course be automated)
just wanted to drop a line and thank @znicholls for this idea and prospective implementation, it sounds great to me! :beer:
Ok, I see where you are going. Given that we generally have a list of all the datasets, how about simply joining all cubes in a list by identifying the potential parent and performing the join if the parent is part of the list?
Ok great. Last question before I prototype: should this be a pre-processing option (would require users to somehow specify what the child dataset is in their recipe) or is it better as a post-processing utility script for use in diagnostics? I'm leaning to the latter because it seems much simpler. Perhaps once we have the latter, one of you can see how to make it useful in pre-processing too...
My intuition is to put it into the loading parts that we already have, but if more information by the user is needed, that may not be feasible.
Ok great thanks for the feedback
In any case, don't let it stop you. I am sure once the code is in place we can move it to the best location. Good luck and thanks for giving it a shot! :+1:
@zklaus and @valeriupredoi I've made a prototype locally. It is about 700 lines altogether (but will hopefully be shorter once cleaned up). I obviously don't want to make a 700 line PR (plus tests) so I will try and break it down. My thinking was that I would tackle it in the following way:
dev-add-ancestry
. That branch can be where all the PRs get added todev-add-ancestry
will have a couple of hundred lines of new code compared to main
. This could then all be merged into main
at once, comfortable in the knowledge that it has all been added and reviewed appropriately along the way.An alternative would be to merge all the PRs into main
as we go. I assume that's not a desirable route as you'd then have a bunch of stuff with NotImplementedError
floating around or have to deal with a massive PR, neither of which seem desirable.
Thanks for your thorough approach @znicholls. I don't think we've had such a well-thought out approach yet. We did have some massive PRs, however, and those tend to create delays and headaches. I am willing to try your approach, perhaps it will serve as a model for large developments in the future? Maybe you could open the dev-add-ancestry
PR right away, that way we have a good place to document the approach (in the description of the PR), and we know that all the code that appears in that PR has already passed earlier review. If you keep this super PR as a draft, we should be able to keep good track.
PS: Sorry I missed you at EGU. I saw you were around but, alas, too late.
I am willing to try your approach, perhaps it will serve as a model for large developments in the future?
Yes maybe. Although I should note I've never tried it like this either. I just know that massive PRs are basically unworkable and had heard about this sort of approach from others.
Maybe you could open the
dev-add-ancestry
PR right away, that way we have a good place to document the approach (in the description of the PR), and we know that all the code that appears in that PR has already passed earlier review. If you keep this super PR as a draft, we should be able to keep good track.
Done now: #1611
PS: Sorry I missed you at EGU. I saw you were around but, alas, too late.
No worries, will any of the team be at the ESM2025 meeting?
Do you want the dev-add-ancestry
branch to be in the main repository or in my fork? (If in the main repository, I'll close #1610 and make a new PR)
No worries, will any of the team be at the ESM2025 meeting?
Yes! I will be there and I think also @schlunma and @hb326.
Do you want the
dev-add-ancestry
branch to be in the main repository or in my fork? (If in the main repository, I'll close #1610 and make a new PR)
As long as the CI is working, I think the fork is fine.
As long as the CI is working, I think the fork is fine.
Ok so it looks like CI doesn't run in forks https://github.com/znicholls/ESMValCore/pull/1 (my understanding this is something that can be changed but not sure that's best practice). Can you create a dev-add-ancestry
in the ESMValGroup
fork so I can make PRs into it and have the CI run or do you want to go a different route?
Yes! I will be there and I think also @schlunma and @hb326.
Awesome
No, and we don't really want to change that - CI running in forks, that is, because that'd mean a lot of ownership/credentials headaches. What's ESM2025 - might go there meself 😁
No, and we don't really want to change that - CI running in forks, that is, because that'd mean a lot of ownership/credentials headaches
Very wise
Can you create a
dev-add-ancestry
in theESMValGroup
fork so I can make PRs into it and have the CI run or do you want to go a different route?
So can a branch be created?
No, and we don't really want to change that - CI running in forks, that is, because that'd mean a lot of ownership/credentials headaches. What's ESM2025 - might go there meself grin
Seems to be working fine for every other big open source project. Oh, you probably mean giving every contributor commit rights to the main repo is creating headaches. Yes, that seems to be the consensus. Anyways, that discussion is off topic here, I guess.
@znicholls, are you already a member of the ESMValGroup? If yes, you should be able to create the branch in the main repo. If not, let us know and we can add you.
Seems to be working fine for every other big open source project
My understanding was that if you create a PR to a branch in a fork it doesn't trigger CI for credentials reasons (but maybe let's not disappear down that rabbit hole here)
are you already a member of the ESMValGroup?
Ah yes I am, I did not realise. Thanks
Yeah that, Klaus! Sorry - on holidays so I was too lazy to write more stuff 🤣 Cheers muchly @znicholls and @zklaus for taking good care of this btw, guys 🍺
Alrighty all setup in #1611
Is your feature request related to a problem? Please describe.
Related to #1161, I have a question about esmvaltool's capacity to link back through a file's 'ancestry'. I've done work on this in another package (gitlab.com/znicholls/netcdf-scm, paper here https://rmets.onlinelibrary.wiley.com/doi/10.1002/gdj3.113) because esmvaltool wasn't mature enough at that stage to be a viable solution. Now it looks awesome (congrats) so I was wondering if porting some of the functionality in would be helpful.
To keep things simple, let's here just focus on traversing a file's ancestry. For example, I have a G6solar file, whose parent is ssp585, grandparent is historical and great-grandparent is piControl. In some cases, all the files will have the same ensemble member id (e.g. r1i1p1f1) but sometimes the parent might be from a different ensemble member (e.g. G6solar is r3i1p1f1 and ssp585 is r1i1p1f1).
EDIT: answered in docs (my bad)
First question: Is esmvaltool currently able to traverse such a history and create a continuous timeseries? My impression is that it can join experiments with the same ensemble member using a syntax like- {dataset: UK-ESM1-LL, project: CMIP6 exp: [historical, ssp585, G6solar], ensemble: r1i1p1f1}
but I'm guessing you can't do- {dataset: UK-ESM1-LL, project: CMIP6 exp: [historical, ssp585, G6solar], ensemble: [r1i1p1f1, r1i1p1f1, r3i1p1f1]}
?Second question: Is esmvaltool able to handle experiment overlap appropriately? For example, you need to use ssp585 to bridge the gap between the end of historical (2014) and the start of G6solar (2019 from memory) but a straight time concatenation doesn't work because you end up with both ssp585 and G6solar data (from e.g. 2019 to 2100).
I'm not sure what the best way to implement this would be. If I were doing it now, I would try something like:
Would you be able to help out?
Yes (the motivation for me is that it would be great to port this functionality into a maintained project that has all the other benefits of esmvaltool and its community rather than leaving them in a drifting package with a maintenance team of 2).