enhancement for forecast data

taylor13 commented 3 years ago

The decadal prediction community have proposed an extension of CMIP6 standards to accommodate additional metadata when forecasts are stored by CMOR. Their proposal can be found at https://docs.google.com/document/d/1T-Xlkc07kzDbtyFp-JJiRQYmHO-OzFOgXjCvFGW5M7Y/edit .

From a brief look at that document, it appears that it will be a non-trivial enhancement that might include:

Currently CMOR attaches a "coordinates" attribute to a variable for the following cases:
- scalar coordinates (e.g. variable tas has a scalar coordinate "height" where the height of the surface air temperature is stored)
- coordinates when text string labels are needed (e.g., for variables that are a function of some index, like the index used to distinguish between different ocean basins. In this case the ocean basin name is stored in an auxiliary coordinate associated with the variable through the "coordinates" attribute.
- latitude and longitude values for grids other than cartesian latitude x longitude grids.

(See https://goo.gl/neswPr for further description of the CMIP6 requirements.) For forecast data the "coordinates" attribute should include "reftime" and "leadtime" in its list. The "reftime" is a scalar coordinate and can probably be treated by CMOR just like the other scalar coordinates except it probably should not have a default value specified in the CMOR table. Also, CMOR may be currently limited to handling a single scalar coordinate for a given variable (not sure about this), so this might be an extension.

Storing the "leadtime" will likely require more work. This auxiliary coordinate variable must be the same size (length) as "time" and contains the time elapsed since the start of the forecast measured in days. It is equivalent to "time" except for the reference time is specified in the time coordinate's units attribute, whereas the reference time for "leadtime" is stored in "reftime". These two reference times will generally be different, so the "leadtime" will be offset from "time" by a constant amount. CMOR could generate the "leadtime" values, given the "time" coordinate values, by simply subtracting from "time" the difference between "reftime" and the reference time stored in the units attribute of "time". (Given this close and known relationship between leadtime and time, I wonder why "leadtime" is needed; wouldn't "reftime" be sufficient?)

The format for "sub_experiment" has been extended to include the "day" as well as year and month when the experiment was initiated. This seems like a minor change that should be easy to accommodate.
There are additional global attributes called for that include free text supplied by the user. I think CMOR can already handle these.

I'll try to get Paco to look this over and see if I've covered the requirements. Also, perhaps he can explain the rationale for including "leadtime".

martinjuckes commented 3 years ago

Hello @taylor13 : thanks for raising this and mapping the requirements accurately into the CMOR framework. This looks good. It may be worth adding that the "reftime" coordinate should have standard_name=forecast_reference_time and "leadtime" should have "standard_name=forecast_period".

Can we make it a requirement that "reftime" has the same units string as the time coordinate, and also that the units of the "leadtime" should be the same apart from the absence of the time datum (i.e. there is no "since ..." in leadtime.units)?

I think "reftime" also needs a calendar, which should match the time coordinate.

The requirement for "leadtime" comes, I believe, from the need to work with tools which can aggregate data across multiple files based on coordinates that are explicitly represented, as "leadtime" will be, but which are not currently able to work with a "leadtime" which is defined implicitly. It occurs to be that this redundancy is avoided in vertical coordinates by using formula_terms rather than explicitly recording coordinates which can be easily computed. We don't currently have that option for "leadtime".

durack1 commented 3 years ago

It will be useful to engage @matthew-mizielinski @mauzey1 and @piotrflorek on this issue

matthew-mizielinski commented 3 years ago

Hi All,

After a little thought I think I understand this as a step towards datasets containing a single variable for an ensemble of initialised simulations with different start dates.

If we consider the case where we have a decadal forecasting ensemble with members started every year from 2000 to 2005 each running for 10 years then we could imagine a single file with an ensemble dimension to cover the a set period, say 2005-2010, at which point the resulting files would look something like (ignoring bounds variables and global attributes)

netcdf tas_Amon_.... {
dimensions:
    time = UNLIMITED ; // (60 currently)
    lat = X ;
    lon = Y ;
    bnds = 2 ;
    realization = 6 ;
variables:
    double time(time) ;
        time:calendar = "gregorian" ;
        time:standard_name = "time" ;
        time:units = "days since 1850-01-01" ;
        ...
    double leadtime(time, realization);
        leadtime:standard_name = "forecast_period" ;
        leadtime:calendar = "gregorian" ;
        leadtime:units = "days since forecast_reference_time" ;
        ...
    double reftime(realization);
        reftime:standard_name = "forecast_reference_time" ;
        reftime:calendar = "gregorian" ;
        reftime:units = "days since 1850-01-01" ;
        ...
    int realization(realization) ;
            realization:standard_name = "realization" ;
        realization:units = 1 ;
        ...
    double height ;
            ...
    double lon(lon);
            ...
    double lat(lat);
            ...
    float tas(time, lat, lon, realization);
            tas:standard_name = "air_temperature" ;
        tas:units = "K"  ;
        tas:coordinates = "height reftime leadtime" ;
        ...

In this case the reftime would contain the equivalent of [2000-01-01, 2001-01-01, ..., 2005-01-01] and the first time record of leadtime would be something like [5*365, 4*365, 3*365, 2*365, 365, 0] (ignoring leap days). The leadtime variable is a simple function of reftime and time, as @martinjuckes alludes to when referring to the formula_terms used in vertical coordinates such as hybrid height, but I believe it is crucial to work that is more closely aligned with initialised forecasts than with climate.

Adding reftime and leadtime to the tas:coordinates attribute then makes clear that these are extra information, rather than being associated with a particular axis of the data variable.

It would also be possible to construct examples where there is a significant amount of missing data, e.g. if the file in question covered 2000-2005 with the same start dates for the realizations then around half of the data variable would be missing data and the corresponding leadtime would be negative (or should this be masked?). Some thought on appropriate chunking may be needed to ensure reasonable performance.

If I've interpreted this correctly then the document that @taylor13 has linked to above is not recommending the above structure, but a step towards it, with only a single ensemble member per file and as a result the realization dimension gets squeezed out.

I think an argument could be made for retaining the realization as a dimension if only to allow CMOR to support the general case.

martinjuckes commented 3 years ago

@matthew-mizielinski : regarding your last two paragraphs: yes, this is a recommendation for including additional CF metadata about forecast_reference_time and forecast_period without departing from the established CMIP practice of having data from a single simulation in each data file. The decadal forecast community who are asking for this are used to working with data from multiple start-times in a single file, but they will be able to adapt easily to the proposed approach if the information requested above is added (e.g. it will be possible to combine data with different start times into a single file using a simple CDO command .. Pierre-Antoine Bretoniere, the 2nd author has tested this). Leon Hermanson from UKMO has also been involved in the discussions (there is a list of contributors in section 4).

taylor13 commented 3 years ago

In that case, I think we shouldn't try to extend CMOR to handle multiple simulations (start times) in a single file; that would require major changes.

matthew-mizielinski commented 3 years ago

Thanks @martinjuckes, I've had a chat with Leon about this (this was the initial motivation for us getting more involved in CMOR). Given that there is a tool for ensemble combination and @taylor13's suggestion that major changes would be needed we'll start with the single ensemble member case.

@taylor12, @durack1, If you are happy with the following I'll leave @piotr-florek-mo and @mauzey1 to start discussing API details. We need to add the following functionality;

ability to add an "auxilliary" leadtime axis or define an axis pair instead of a single time dimension
ability to add scalar dimensions for the realization and reftime variables
appropriate handling of coordinates attribute

durack1 commented 3 years ago

@matthew-mizielinski an additional time axis is a fairly major change, but will certainly aid the use of CMOR for forecast production generation.

It would be nice to build out the CI testing as this work begins, I have a feeling that unless we have pretty good code coverage some quirks could be introduced, I am sure that @taylor13 will have some insights

taylor13 commented 3 years ago

@martinjuckes (cc @matthew-mizielinski @piotr-florek-mohc ) Could you please check with the Copernicus folks and ask them to clarify what their requirements are? Will any time-mean forecast fields be archived, or only synoptic fields? If they plan to archive time-mean forecast fields (or accumulated "sum" fields), should there be bounds on the leadtime auxiliary coordiante associated with these time means, or can the bounds be omitted (nb.: the bounds will be saved as part of the actual "time" coordinate.). Also, should the "leadtime" coordinate record the beginning, middle, or end of each time averaging period? [Note for CMIP, all time-mean fields have time-coordinate values that record the middle of a time averaging period.]

taylor13 commented 3 years ago

I apologize that there may be some delay before I can check this out in detail. I also apologize that you have had to figure out how to do this without much background from the original code writers. Perhaps it is not too late to be helpful in that regard.

CMOR's purpose is to facilitate writing model output conforming to a project's data specifications. It is most useful for community modeling activities, similar to CMIP, where multiple groups wish to share model output that can be read in a common way. The idea is that many of the descriptive elements that comprise much of the CMIP metadata are the same for all models and can therefore be recorded in tables that CMOR can access. The user needs to tell CMOR which entry in the table is relevant, but otherwise should only need to provide CMOR with the actual model data being written and certain model-specific information (e.g., model name, model grid information, model calendar). By relying on tables provided by CMIP, data writers are prevented from introducing errors regarding, for example, the description of variables (standard_name, cell_methods, etc.) and certain global attributes. CMOR also is meant to check that for information the user must supply, it follows certain conformance rules (e.g., is of the correct data type and structure, is ordered correctly, is self-consistent, etc.) Finally, CMOR can transform input data in various ways to make it conform to the data specs (e.g., scale data to make it consistent with the requested units, restructure data so that the domain spanned and the direction of the coordinates is consistent with requirements, modify (if necessary in the case of time-mean data) the time coordinate values to conform to the CMIP rule that they should be at the middle of the averaging periods, etc.).

In its original release, CMOR performed its checks only before writing the data and informed the user when it encountered any problems/errors. In later releases, a CMOR conformance checker was built that could perform some (but not all) of CMOR checks on data that had been written without CMOR. I am less familiar with the "checker" than I am with CMOR (original).

As we extend CMOR to handle forecast data, we could simply enable users to add the additional "reftime" global attribute and the "leadtime" time coordinate (with appropriate variable attributes). As I understand it, this has now been accomplished (although perhaps not fully tested yet). If that is all that has been, that may be sufficient, but I can envision CMOR being more helpful both to data writers and to the projects they contribute to if CMOR were more ambitious. [Again, I have not yet looked in detail at the changes already implemented, so some of the following may already be implemented.] Resources permitting, ideally:

For each entry in a CMOR variable table a flag would be included indicating whether that variable represents a forecast or not (e.g. "forecast=[n]" where [n] is an integer set to "0" if "not a forecast variable" (for all current CMIP data forecast=0), "1" if a forecast for a single lead time should be stored, and (for a possible future extension of CMOR) "2" if the file contains forecasts for multiple lead times.
If a variable has been identified as a forecast, the user would be required to define the "reftime" and its units. [I think we should require its units to be the same as the units for the time coordinate in the file, in which case CMOR could obtain the units for reftime from the time coordinate.]
CMOR would automatically generate all the supplemental information requested by Copernicus for forecast data, including the vector of leadtimes.

To be more specific,

For every variable in a CMOR variable table, a descriptor would be included (by the project using the table) specifying whether or not the the variable is considered a forecast. (For backwards compatibility, CMOR would assume that if this entry were missing, then data being processed must not be a forecast.) For forecast variables the the "dimensions" descriptor for the variable would include the CMOR coordinates table entry i.d.'s for "reftime" and "leadtime". For example, if the entry i.d. for reftime were "reftime1" and the entry i.d. for leadtime were "leadtime2", the both reftime1 and leadtime2 would appear in the list of "dimensions" given for that variable.
If a project includes "forecast variables", then in the CMOR coordinate variable table the project would include one or more entries for "leadtime". Two entries would be required, for example, if some of the forecast variables were "snapshots" (synoptic fields), and others were time-mean fields. (Both would have the same variable name in the output file ("leadtime"), but to distinguish them within CMOR one might have a table entry i.d., of "leadtime1", and the other "leadtime2". The difference between the two is that for a synoptic field, the must_have_bnds designator would be "no", whereas for a time-mean field must_have_bnds="yes".)
If a project includes "forecast variables", then in the CMOR coordinate variable table the project would include a single entry for "reftime".
If for a variable, "forecast = 1",
a. the user would be required to define "reftime" (perhaps referenced under the variable table entry i.d. "reftime1") b. CMOR would include reftime in the "coordinates" attribute list, and store it in the file under the variable name "reftime". c. CMOR would attach the units defined for "time" to "reftime" d. CMOR would include with reftime its standard_name and longname as obtained from the CMOR coordinate table. e. CMOR would generate and store the vector: leadtime = time - reftime f. CMOR would include "leadtime" in the "coordinates" attribute list. g. CMOR would include as attributes of "leadtime" its standard_name and longname as obtained from the CMOR coordinate table and also its units taken from the variable's time coordinate units, but omitting the basetime (i.e. the text in the units attribute beginning with "since ....").

We could also consider explicitly including the calendar attribute for reftime, but I would recommend that we might simply let it be known that the calendars for reftime must be the same as the calendar for the time coordinate.

CMOR should perform a number of checks, including alerting the user when:

The user has unnecessarily defined "reftime" for a variable for which "forecast = 0".
The user has failed to define "reftime" for a variable for which "forecast = 1".
The "reftime" precedes the first time-slice stored. (If it doesn't, then clearly the user has incorrectly defined either time or reftime.)

Note, there is an inquiry at https://github.com/PCMDI/cmor/issues/623#issuecomment-939554698 asking for clarification of what the requirements are for forecast time.

I realize that making CMOR easy for others to use and helping guarantee that forecast data requirements are met may not be possible given resource constraints, but I hope it is. Otherwise, what has already been implemented in this regard will have to do (and perhaps that is more than I am aware of).

taylor13 commented 3 years ago

Just realized, I left out the requirement that the file name also should include the reference date, so this too could be generated automatically by CMOR. Take the above as trying to cover the bases and make it easier for multiple groups to use cmor for forecast data. It's possible that a quasi-independent call to a single "subroutine" that is pretty much independent of CMOR could accomplish this (without getting into the guts of CMOR), which would be easier to code and easy enough for users, if not fully "integrated" with the workings of CMOR. I'm sure you've already considered these things. I still need to better understand what you've actually implemented.

taylor13 commented 3 years ago

Also, would be nice if PrePARE (the CMOR checker) would be able to check that these new forecast requirements had been met.

piotr-florek-mohc commented 3 years ago

Hi Karl, thank you for your thoughtful comments. I agree that the "quick and dirty" approach implemented in #623 might be a bit too much on the "dirty" side, and it would be much better to align this modification with the standard CMOR api, using MIP tables to enable new time coordinates instead of relying on a call of my bespoke function. My C is now slightly better than it used to be, and I have learned a bit more about CMOR's structure; so I'll take what I had done as a starting point and will try to edge it towards more elegant solution.

taylor13 commented 3 years ago

I should say that I appreciate your strategy of implementing changes in the way that would likely minimize the chance of causing some subtle conflicts with the rest of CMOR. That was sensible and smart. So the idea of not trying to integrate them with the rest of the structure might still be the best way to go. But thanks for thinking about some other options.

wachsylon commented 3 years ago

Hi all, I also liked the detailled description of @taylor13 very much. I am running it as a thought experiment for a hypothetical cmip7 and I do not understand how a flag can be used for additional dimensions. Wouldn't we need two variable entries in one table, e.g., Amon :

tas
dimensions : time
..
tasForceast
dimensions: time reftime leadtime

? I fear I make things more complicated but if there could be an approach similar to alevel dimension where more than one coordinate is allowed for the dimension of the corresponding variable, we would save additional variable entries. It could look like that:

tas
dimensions : tlev

and in coordinate.json we would have

reftime
            "generic_level_name": "tlev"

taylor13 commented 3 years ago

I hadn't thought about the extra variables needed. We could, of course, define two tables (e.g., "Amon" and "Amonforecast"), which would be identical except for for the value of the "forecast" flag. In Amon for all variables it would be forecast=0, whereas in Amonforecast it would invariably be forecast=1.
By the way, it is hoped that in CMIP7, we'll avoid identifying datasets by the CMOR table names, so having twice as many tables won't be too disruptive perhaps.
Another idea would be to use the same table for forecasts and not-forecasts, but have the user trigger different CMOR behavior for a variable by simply defining reftime for the variable. Once the user defined reftime, the forecast flag would be changed from 0 to 1 and everything would proceed as before.

matthew-mizielinski commented 3 years ago

@piotr-florek-mohc and I have had a think about this and have come to the conclusion that the presence of reftime leadtime in the dimensions for a variable would be a suitable substitute for a forecast flag, which would leave a neater MIP table and potentially be easier to implement in PrePARE. We hope to have the changes ready next week.

@taylor13, are you happy with this or would you prefer that the forecast flag be included to highlight the difference between variables with and without the extra coordinate.

The changes are still going to be a little rough around the edges within CMOR as we are trying to avoid being too disruptive to the existing code as this could easily take a significant amount of time for limited benefit. I am open to revising the approach once we understand the implications of the changes, i.e. feed our first approach into the next version of CMOR and then revise in the new year following experience of its use.

When we start looking forward to CMIP7 I'm not 100% sure what the impacts are likely to be if we have some data with these forecast time coordinates (e.g. DCPP like) and some without (e.g. the DECK & Scenarios), so we might need to consult about this functionality being used within CMIP in the future. I could see us effectively having one set of tables with the forecast coordinates and another without, at which point we'd need to think about how best to maintain this. We could use separate entries in the same table as @wachsylon has suggested, but I am not fond of the situation we have at the moment with multiple variables pointing at the same outname (e.g. hus and hus27 in Emon).

taylor13 commented 3 years ago

As you say, a "forecast" flag isn't strictly needed in the CMOR table, because it can be inferred from the dimensions specification, and I too prefer a separate table for variables that are requested for forecasts.

Eliminating the "forecast" flag from the table does, as you say, not change its structure, so that's a bit less disruptive. Two questions though: 1) within CMOR do you set an internal forecast flag, or do you interrogate the "dimensions" string each time you need to know whether you are dealing with a forecast? 2) Do you think that in the future we will confront a request where multiple forecast start times (reftimes) might need to be accommodated by CMOR? If so, it might be nice to be able to indicate that with a forecast flag value different from both "no forecast" and "a single forecast". [I guess we could always add it if needed later.]

matthew-mizielinski commented 3 years ago

Hi @taylor13

Regarding 1; @piotr-florek-mohc, could you comment here

Regarding 2; I can imagine needing something like this if we attempt to combine the data from multiple forecasts within an ensemble into a single object. I recall seeing something where an ensemble of datasets was concatenated along a dimension with an associated ensemble coordinate. In this case we'd be looking at a non-scalar reftime coordinate and a 2D leadtime coordinate to capture the required information for ensemble members initialised at different times.

This feels like quite a big step to take with CMOR, and my instinct is that this would be for a downstream tool to handle rather than attempting to push all data through CMOR at once. From a workflow perspective it would be far quicker to push 10 datasets through CMOR separately in parallel and then carefully combine the results rather than push all data through a serial process. This isn't to say it couldn't be done, but I think until we have a requirement or clear motivation for this I would be tempted to wait.

piotr-florek-mohc commented 3 years ago

Hi Karl,

within CMOR do you set an internal forecast flag, or do you interrogate the "dimensions" string each time you need to know whether you are dealing with a forecast?

The latter (although I did it in a different way in the original implementation - I was checking for the presence of the reference time variable within ncdf).

The main problem I have is that CMOR requires some preparatory work to set up variables (and it is during this step when mip tables are interrogated and meta-data pulled and ingested), and this preparatory work might or might not include the leadtime variable, depending on how much of this we would like to automate.

(Also, if "leadtime" is mentioned in the "dimensions" string, CMOR would initially still expect it to be a real, non-auxiliary coordinate, increasing the number of temporal dimensions; so this needs to be tweaked to make both "reftime" and "leadtime" a special case for which this behaviour would be disabled.)

taylor13 commented 3 years ago

thanks to both of you. I'm comfortable with your approach (not that you should care too much about that).

If I understand what you say, @piotr-florek-mohc, for CMOR not to become confused, we must reserve "reftime" and "leadtime" for use as coordinate variables for the special case we're considering. I think that constraint on their use is acceptable. Or did you have something else in mind that you wanted me to comment on?

piotr-florek-mohc commented 3 years ago

Hi @taylor13 ,

If I understand what you say, @piotr-florek-mohc, for CMOR not to become confused, we must reserve "reftime" and "leadtime" for use as coordinate variables for the special case we're considering. I think that constraint on their use is acceptable. Or did you have something else in mind that you wanted me to comment on?

after a short discussion with @matthew-mizielinski we came up with a more elegant solution, that is marking reftime and leadtime with special flags in the CMIPx_coordinates.json file, so CMOR would be able to recognise them correctly as scalar/auxiliary temporal coordinates without hardcoding their standard names.

taylor13 commented 3 years ago

Nice! Would this be backward compatible with old coordinate tables that would have no such flag entries? (i.e., in the coordinate tables, would the flag be optional or omitted for all other types of coordinates?)

piotr-florek-mohc commented 3 years ago

Hi @taylor13 Apologies for late reply, I wanted to make sure that I could make it backward compatible, and while that is the case, it's not the most elegant solution and I'm not entirely happy with the way I implemented it.

One of the complications is that reftime was already implemented in 3.2.4 (most of it in 4676b7d82bd3d6c9b63cd560b67bad6a1a440237), and it complicated CMOR's axis handling a bit. I don't quite follow what happens there, so I couldn't make my code work without breaking that bit, so in the end I call the leadtime creating function as a special case after everything else is done.

mauzey1 commented 2 years ago

@piotr-florek-mohc

Looking over the new features added to CMOR in #634, I've noticed that there is a function in cmor_func_def.h that is not defined anywhere.

https://github.com/PCMDI/cmor/blob/847179fbf3763b47bd73d57ff177edafbf18e34d/include/cmor_func_def.h#L111

Is the function calculate_leadtime_coord meant to be cmor_calculate_leadtime_coord?

I've also noticed that there is a Python interface for the function but not for Fortran. Should we have one?

mauzey1 commented 2 years ago

I'm guessing this issue can be closed since the lead time coordinate has already been added to CMOR via #634. Please reopen if further work is needed.

PCMDI / cmor

enhancement for forecast data #623