File naming convention for SNAP data

charliepascoe commented 3 years ago

Suggest we use a similar convention to ccmi-2022 <variable_id>_<table_id>_<source_id>_<experiment_id >_<variant_label>_<grid_label>[_<time_range>].nc e.g. zmo3_monthly_HadGEM3-ES_refC1r1i1p1"gridLabel"_196001-196810.nc

aph42 commented 3 years ago

The protocol calls for four types of ensemble forecasts (free, control, nudged, and full; these differ with respect to what kind of nudging is imposed on the ensemble). We're asking for each of these ensembles to be initialized on 6 different dates. We could make each initialization date a different experiment, but I think it will be more flexible and understandable if we add the initialization date to the filename; something like

<variable_id>_<table_id>_<source_id>_<experiment_id >_<initialization_date>_<variant_label>_<grid_label>[_<time_range>].nc

On the other hand the runs will only be 45 days long, and ideally there will be only one file per variable per ensemble member. So <time_range> will need to be at least yyyymmdd; part of me wonders if it could even be omitted.

We are also targeting large ensembles, e.g. 50 members (we may receive 100 from some centers). I believe that there will be different strategies for initialization - in particular some centers might initialize some ensemble members on slightly different times. I don't think that any centers are varying model physics, but am not 100% of that fact. With regards to the variant_label field, i'm not sure whether all centers would use 'i' or 'r', for instance, to label ensemble members. Should we leave that up to the modeling centers to decide which is appropriate?

leknifrg commented 3 years ago

To some degree I don't see a substantive different between Charlotte's and Peter's suggestion. We can have 24 different possible experiments IDs, corresponding to each of the 4 planned variations of nudging, 3 different SSW events, and 2 different initialization dates for each event. In this case would be something like NH2018_FREE_Jan25.

Peter's suggestion is that initialization date be its own item, but the net effect on the filename is identical, right? Chaim

aph42 commented 3 years ago

You're right; it doesn't make a difference to the filenames really.

Where it makes some difference is in the specification of the experiment ids in the CV.json table; I'm still piecing together how all of this is supposed to work, but I think my suggestion is to define our experiment_id's as FREE, CONTROL, NUDGE, and FULL. I think we could use the sub_experiment_id's to specify the intialization dates

(perhaps consistent with, e.g., slide 7 here)

So perhaps it should be

<variable_id>_<table_id>_<source_id>_<experiment_id >_<subexperiment_id>_<variant_label>_<grid_label>[_<time_range>].nc

This means the specification of the experiments in SNAPSI_CV.json will end up looking something like

    "sub_experiment_id": {
        "2018-01-25": "Forecasts initialized on or near 2018-01-25",
        "2018-02-08": "Forecasts initialized on or near 2018-02-08",
        "2018-12-13": "Forecasts initialized on or near 2018-12-13",
        "2019-01-08": "Forecasts initialized on or near 2019-01-08",
        "2019-08-29": "Forecasts initialized on or near 2019-08-29",
        "2019-10-01": "Forecasts initialized on or near 2019-10-01"
    },
    "experiment_id": {
        "free": {
            "activity_id": "SNAPSI",
            "experiment": "Free-running ensemble forecast",
            "experiment_id": "free",
            "parent_experiment_id": "no parent",
            "sub_experiment_id": [
                "2018-01-25", "2018-02-08", "2018-12-13", "2019-01-08", "2019-08-29", "2019-10-01"
            ]
        },
        "control": {
            "activity_id": "SNAPSI",
            "experiment": "Ensemble forecast with zonally symmetric component of stratosphere nudged to climatology",
            "experiment_id": "control",
            "parent_experiment_id": "no parent",
            "sub_experiment_id": [
                "2018-01-25", "2018-02-08", "2018-12-13", "2019-01-08", "2019-08-29", "2019-10-01"
            ]
        },
        "nudged": {
            "activity_id": "SNAPSI",
            "experiment": "Ensemble forecast with zonally symmetric component of stratosphere nudged to reanalyzed evolution",
            "experiment_id": "nudged",
            "parent_experiment_id": "no parent",
            "sub_experiment_id": [
                "2018-01-25", "2018-02-08", "2018-12-13", "2019-01-08", "2019-08-29", "2019-10-01"
            ]
        },
        "nudged-full": {
            "activity_id": "SNAPSI",
            "experiment": "Ensemble forecast with stratosphere nudged to reanalyzed evolution",
            "experiment_id": "nudged-full",
            "parent_experiment_id": "no parent",
            "sub_experiment_id": [
                "2018-01-25", "2018-02-08", "2018-12-13", "2019-01-08", "2019-08-29", "2019-10-01"
            ]
        }

which is (arguably) cleaner than having 24 different experiment ids and I could allow us potentially to add new initialization dates relatively easily.

leknifrg commented 3 years ago

yes, Peter, your suggestions looks reasonable. We don't need to specify NH2018, NH2019, SH2019 if we already specify the initialization date.

martinjuckes commented 3 years ago

The CMIP6 approach to ensembles of experiments is to treat them as different sub-experiments, with sub_experiment_id of the form s1970 for a start in 1970 (in CMIP6 they only have to deal with a single start time each year for decadal simulations). The files names used are, e.g., tos_Omon_MPI-ESM1-2-HR_dcppA-hindcast_s1963-r1i1p1f1_gn_196311-197312.nc.

This is slightly different from Peter's suggestion:

<variable_id>_<table_id>_<source_id>_<experiment_id >_<subexperiment_id>-<variant_label>_<grid_label>[_<time_range>].nc

with a hyphen - between subexperiment_id and variant_label rather than underscore. Sticking with the CMIP6 approach will make things easier for us and, I believe, for users. @aph42 : would this be OK for you?

I would also suggest a second variation on the above: using syyyymmdd for the simulation start time in the filename rather than yyyy-mm-dd. This is for consistency with the approach used in the start and end date at the end of files and also consistency with the fact that the hyphen is used in CMIP file names as a separator between sub-elements of identifiers. Using it in the date confuses the syntax of the file name. Keeping the s at the start of the string is clearly redundant -- but that is the way it work for CMIP6 (the idea, if I remember rightly, was to future-proof the file naming convention so that there is redundancy that can be used to accommodate other variations in experiments which may come along in future years).

Finally, while the time_range may be redundant, keeping it in will keep consistency with CMIP which is an advantage for users. It also helps us with file handling.

On the variant_id: the r values should be varied for multiple simulations at the same start time and same initialisation method; the i value should be varied for different initialisation methods. If the only variation is the start time, keep these fixed.

leknifrg commented 3 years ago

@martinjuckes Martin - All of your suggestions look reasonable to me, but I would appreciate your feedback regarding the last one.

We anticipate three different types of ensembles being submitted: (1) slightly different parameterizations (2) slightly different initialization times (3) differences in initial conditions as specified by the models' data assimilation systems. Below is my guess at how we should request things from the modeling centers: CMIP6 supports four options: r1i1p1f1, with r: realisation (i.e. ensemble member): (2-slightly different initialization times) goes here i: initialisation method: (3 - changes in initial conditions due to DA) goes here p: physics: (1 -slightly different parameterizations) goes here f: forcing: unused as of now?

aph42 commented 3 years ago

@martinjuckes Martin: Yes, I am happy to take on board your suggestions to make our conventions more consistent. I'll update the names of the sub experiment ids.

martinjuckes commented 3 years ago

@leknifrg : sorry for the long delay in responding to your 22nd April comment.

Slightly different initialisation times: this depends on what you mean by "slightly". If you mean that the simulations all start with date and time at, say, 2021-06-01T12:00:00 but use initialisation data from slightly different times, then yes, this would be considered as an approach to generating random variations labelled by different r values. If you are changing the actual start time of the simulation, then it should be represented in the subexperiment_id element.

I agree with your interpretation of i and p.

cedadev / snap

File naming convention for SNAP data #2