Tracing a file's ancestry

znicholls commented 2 years ago

Is your feature request related to a problem? Please describe.

Related to #1161, I have a question about esmvaltool's capacity to link back through a file's 'ancestry'. I've done work on this in another package (gitlab.com/znicholls/netcdf-scm, paper here https://rmets.onlinelibrary.wiley.com/doi/10.1002/gdj3.113) because esmvaltool wasn't mature enough at that stage to be a viable solution. Now it looks awesome (congrats) so I was wondering if porting some of the functionality in would be helpful.

To keep things simple, let's here just focus on traversing a file's ancestry. For example, I have a G6solar file, whose parent is ssp585, grandparent is historical and great-grandparent is piControl. In some cases, all the files will have the same ensemble member id (e.g. r1i1p1f1) but sometimes the parent might be from a different ensemble member (e.g. G6solar is r3i1p1f1 and ssp585 is r1i1p1f1).

EDIT: answered in docs (my bad) First question: Is esmvaltool currently able to traverse such a history and create a continuous timeseries? My impression is that it can join experiments with the same ensemble member using a syntax like - {dataset: UK-ESM1-LL, project: CMIP6 exp: [historical, ssp585, G6solar], ensemble: r1i1p1f1} but I'm guessing you can't do - {dataset: UK-ESM1-LL, project: CMIP6 exp: [historical, ssp585, G6solar], ensemble: [r1i1p1f1, r1i1p1f1, r3i1p1f1]} ?

Second question: Is esmvaltool able to handle experiment overlap appropriately? For example, you need to use ssp585 to bridge the gap between the end of historical (2014) and the start of G6solar (2019 from memory) but a straight time concatenation doesn't work because you end up with both ssp585 and G6solar data (from e.g. 2019 to 2100).

I'm not sure what the best way to implement this would be. If I were doing it now, I would try something like:

define my files of interest in a recipe
download them using esmvaltool
work out their parents using a custom script, writing a new recipe in the process
download the parents using esmvaltool
work out grandparents using a custom script, writing a new recipe in the process
download grandparents using esmvaltool
etc. until I get back to piControl
write a recipe based on all the child datasets with a diagnostic that joins them to their ancestry appropriately

Would you be able to help out?

Yes (the motivation for me is that it would be great to port this functionality into a maintained project that has all the other benefits of esmvaltool and its community rather than leaving them in a drifting package with a maintenance team of 2).

znicholls commented 2 years ago

fyi @lewisjared

znicholls commented 2 years ago

First question:

Ignore my first question, I've just read the docs properly

znicholls commented 2 years ago

Re the second question, I just tried the recipe below:

Test recipe

```yaml # ESMValTool # recipe_concatenate_exps.yml --- documentation: description: | Concatenate time series of different experiments at the preprocessor level. title: Recipe to test time concatenation in preprocesor. authors: - schlund_manuel maintainer: - schlund_manuel datasets: - {dataset: CESM2-WACCM, grid: gn, ensemble: r1i1p1f1, activity: [GeoMIP, ScenarioMIP, CMIP]} # - {dataset: CNRM-ESM2-1} # - {dataset: IPSL-CM6A-LR} # - {dataset: MPI-ESM1-2-LR} # - {dataset: UKESM1-0-LL} diagnostics: diag_concatenate_exps: description: Concatenate historical, SSP5-8.5 and G6solar data variables: tas: project: CMIP6 mip: Amon start_year: 1850 exp: [historical, ssp585, G6solar] end_year: 2100 scripts: null ```

It gives the following output, which I think shows that the concatenation doesn't work unfortunately because of the overlap between experiments.

Test recipe output once plotted (annual-mean)

I think the issue is in https://github.com/ESMValGroup/ESMValCore/blob/5d5ddddd633024dee741f54768d44d32023f3a73/esmvalcore/preprocessor/_io.py#L183 and its interaction with https://github.com/ESMValGroup/ESMValCore/blob/5d5ddddd633024dee741f54768d44d32023f3a73/esmvalcore/preprocessor/_io.py#L388

The result is that everything is joint starting from the oldest experiment (i.e. historical) and always picking the longer cube so we end up with ssp585 from 2019-2068 instead of G6solar (because the G6solar experiment comes in two files whereas ssp585 is all in one so the longer ssp585 file gets picked instead of the first, shorter G6solar one). To fix it in a robust way I think you'd have to specify that G6solar was the experiment you were actually interested in and use the branch points to pick the appropriate subset of the parent experiments as you walked back up the ancestry rather than the current concatenation algorithm. I'm not sure how such a fix is best implemented though (new pre-processor?).

zklaus commented 2 years ago

We discussed some aspects of this recently in #1423. Basically, I agree with you. It would be great to have greater flexibility in specifying how a concatenation operation should deal with overlaps. Adding your example is a nice extension to the use-cases that were considered before.

Perhaps we can approach this best by discussing a nice interface that could work both in recipe yaml files and in the pure Python API? Considering that we expect more than one overlap (e.g. picontrol + historical + scenario) do you think it's enough to specify one priority? Perhaps an ordered list of datasets by priority would be good?

PS: If general enough, we might feed that back upstream to Iris.

znicholls commented 2 years ago

My suggestion would be to have a function that looked something like the options below

def join_to_parents(in_file):
    # 1. Load/download the file
    # 2. Find parent
    # 3. Load/download parent file, throwing away all data after the branch point
    # 4. Find parent of parent
    # 5. repeat until some stop condition is hit (e.g. parent is piControl or piControl-spinup)
    # 6. Return concatenation in time (now with no overlap as we throw away data after branch points)

    # Pros:
    # - Only needs one file as input, everything flows from there
    # - Hierarchy is well-defined (only keep parent files from before the branch point)

    # Cons:
    # - Requires downloading file in order to work out what is needed so might not play nice with 
    #    ESMValTool's general idea of defining all datasets to be used up front and then only downloading
    #    once (or at least I think that's ESMValTool's general idea, maybe I misunderstand)

def join_to_parents(child, ancestry):
    """
    Parameters
    ------------
    child : str
        Path to child data

    ancestry : list[str]
        List of files that make up ``child``'s ancestry. Order doesn't matter.
    """
    # 1. Load child
    # 2. Find parent
    # 3. Load parent from ancestry, keeping only data from before branch point
    # 4. Find parent of parent
    # 5. repeat until some stop condition is hit (e.g. parent is piControl or piControl-spinup)
    # 6. Return concatenation in time (now with no overlap as we throw away data after branch points)

    # Pros:
    # - Can be done as a pre-processing step after all files are downloaded in more general workflow
    # - Hierarchy is well-defined (only keep parent files from before the branch point)

    # Cons:
    # - Requires knowing what files make up the ancestry in advance (could of course be automated)

valeriupredoi commented 2 years ago

just wanted to drop a line and thank @znicholls for this idea and prospective implementation, it sounds great to me! :beer:

zklaus commented 2 years ago

Ok, I see where you are going. Given that we generally have a list of all the datasets, how about simply joining all cubes in a list by identifying the potential parent and performing the join if the parent is part of the list?

znicholls commented 2 years ago

Ok great. Last question before I prototype: should this be a pre-processing option (would require users to somehow specify what the child dataset is in their recipe) or is it better as a post-processing utility script for use in diagnostics? I'm leaning to the latter because it seems much simpler. Perhaps once we have the latter, one of you can see how to make it useful in pre-processing too...

zklaus commented 2 years ago

My intuition is to put it into the loading parts that we already have, but if more information by the user is needed, that may not be feasible.

znicholls commented 2 years ago

Ok great thanks for the feedback

zklaus commented 2 years ago

In any case, don't let it stop you. I am sure once the code is in place we can move it to the best location. Good luck and thanks for giving it a shot! :+1:

znicholls commented 2 years ago

@zklaus and @valeriupredoi I've made a prototype locally. It is about 700 lines altogether (but will hopefully be shorter once cleaned up). I obviously don't want to make a 700 line PR (plus tests) so I will try and break it down. My thinking was that I would tackle it in the following way:

I'll create a new branch in my fork e.g. dev-add-ancestry. That branch can be where all the PRs get added to
The first PR will just outline the high-level functionality with documentation only (see #1604, I need to change the target branch but will do so once I've confirmed the right approach)
The next PR will outline key tests
The next PR then adds the implementation
Once all PRs are merged, the branch dev-add-ancestry will have a couple of hundred lines of new code compared to main. This could then all be merged into main at once, comfortable in the knowledge that it has all been added and reviewed appropriately along the way.

An alternative would be to merge all the PRs into main as we go. I assume that's not a desirable route as you'd then have a bunch of stuff with NotImplementedError floating around or have to deal with a massive PR, neither of which seem desirable.

zklaus commented 2 years ago

Thanks for your thorough approach @znicholls. I don't think we've had such a well-thought out approach yet. We did have some massive PRs, however, and those tend to create delays and headaches. I am willing to try your approach, perhaps it will serve as a model for large developments in the future? Maybe you could open the dev-add-ancestry PR right away, that way we have a good place to document the approach (in the description of the PR), and we know that all the code that appears in that PR has already passed earlier review. If you keep this super PR as a draft, we should be able to keep good track.

PS: Sorry I missed you at EGU. I saw you were around but, alas, too late.

znicholls commented 2 years ago

I am willing to try your approach, perhaps it will serve as a model for large developments in the future?

Yes maybe. Although I should note I've never tried it like this either. I just know that massive PRs are basically unworkable and had heard about this sort of approach from others.

Maybe you could open the dev-add-ancestry PR right away, that way we have a good place to document the approach (in the description of the PR), and we know that all the code that appears in that PR has already passed earlier review. If you keep this super PR as a draft, we should be able to keep good track.

Done now: #1611

PS: Sorry I missed you at EGU. I saw you were around but, alas, too late.

No worries, will any of the team be at the ESM2025 meeting?

znicholls commented 2 years ago

Do you want the dev-add-ancestry branch to be in the main repository or in my fork? (If in the main repository, I'll close #1610 and make a new PR)

zklaus commented 2 years ago

No worries, will any of the team be at the ESM2025 meeting?

Yes! I will be there and I think also @schlunma and @hb326.

Do you want the dev-add-ancestry branch to be in the main repository or in my fork? (If in the main repository, I'll close #1610 and make a new PR)

As long as the CI is working, I think the fork is fine.

znicholls commented 2 years ago

As long as the CI is working, I think the fork is fine.

Ok so it looks like CI doesn't run in forks https://github.com/znicholls/ESMValCore/pull/1 (my understanding this is something that can be changed but not sure that's best practice). Can you create a dev-add-ancestry in the ESMValGroup fork so I can make PRs into it and have the CI run or do you want to go a different route?

znicholls commented 2 years ago

Yes! I will be there and I think also @schlunma and @hb326.

Awesome

valeriupredoi commented 2 years ago

No, and we don't really want to change that - CI running in forks, that is, because that'd mean a lot of ownership/credentials headaches. What's ESM2025 - might go there meself 😁

znicholls commented 2 years ago

No, and we don't really want to change that - CI running in forks, that is, because that'd mean a lot of ownership/credentials headaches

Very wise

znicholls commented 2 years ago

Can you create a dev-add-ancestry in the ESMValGroup fork so I can make PRs into it and have the CI run or do you want to go a different route?

So can a branch be created?

zklaus commented 2 years ago

No, and we don't really want to change that - CI running in forks, that is, because that'd mean a lot of ownership/credentials headaches. What's ESM2025 - might go there meself grin

Seems to be working fine for every other big open source project. Oh, you probably mean giving every contributor commit rights to the main repo is creating headaches. Yes, that seems to be the consensus. Anyways, that discussion is off topic here, I guess.

@znicholls, are you already a member of the ESMValGroup? If yes, you should be able to create the branch in the main repo. If not, let us know and we can add you.

znicholls commented 2 years ago

Seems to be working fine for every other big open source project

My understanding was that if you create a PR to a branch in a fork it doesn't trigger CI for credentials reasons (but maybe let's not disappear down that rabbit hole here)

are you already a member of the ESMValGroup?

Ah yes I am, I did not realise. Thanks

valeriupredoi commented 2 years ago

Yeah that, Klaus! Sorry - on holidays so I was too lazy to write more stuff 🤣 Cheers muchly @znicholls and @zklaus for taking good care of this btw, guys 🍺

znicholls commented 2 years ago

Alrighty all setup in #1611

ESMValGroup / ESMValCore

Tracing a file's ancestry #1572