Support for years with 5 digits

wachsylon commented 2 years ago

Hi, for paleo simulations, we have simulation runs which go beyond 10k years. CMOR only writes 4 digits for the years which may leads to parsing problems when the simulation time goes beyond 10000 years.

Maybe, CMOR could support a parameter 'DIGITS_YEARS'?

Best, Fabi

wachsylon commented 2 years ago

Ok that might be more complicated than I thought first. With 6 digits it is unclear if it is years or months. But 5 should be ok, or do I miss sth?

durack1 commented 2 years ago

@wachsylon interesting suggestion, I wonder if @taylor13 has some insights about how this has all been dealt with within PMIP (and the ISMIP6 experiment offshoots as noted below) which considers time periods very long timescales not well represented in modern calendars.

I just took a peek, and ism-lig127k-std is the only CMIP6 experiment that requests more than 9999 years:

        "ism-lig127k-std":{
            "activity_id":[
                "ISMIP6"
            ],
            "additional_allowed_model_components":[
                ""
            ],
            "description":"Last interglacial simulation of ice sheet evolution driven by PMIP lig127k",
            "end_year":"",
            "experiment":"offline ice sheet forced by ISMIP6-specified AGCM last interglacial output",
            "experiment_id":"ism-lig127k-std",
            "min_number_yrs_per_sim":"20000",
            "parent_activity_id":[
                "no parent"
            ],
            "parent_experiment_id":[
                "no parent"
            ],
            "required_model_components":[
                "ISM"
            ],
            "start_year":"",
            "sub_experiment_id":[
                "none"
            ],
            "tier":"3"
        },

It's worth pulling @jypeter into this discussion too

taylor13 commented 2 years ago

The CMIP6 specifications for the "time_range" appearing in the filenames are:

The <time_range> is a string generated consistent with the following:
If frequency = “fx” then
                  <time_range>=””
else
                <time_range> = N1-N2 where N1 and N2 are integers of the form
                                      ‘yyyy[MM[dd[hh[mm[ss]]]]][<suffix>]’ (expressed as a string, 
                                      where where ‘yyyy’, ‘MM’, ‘dd’, ‘hh’ ‘mm’ and ‘ss’ are 
                                      integer year, month, day, hour, minute, and second, 
                                      respectively)
endif

where <suffix> is defined as follows:
if the variable identified by variable_id has a time dimension with a “climatology” 
          attribute then
                   suffix = “-clim”
else
                   suffix = “”
endif

and where the precision of the time_range strings is determined by the “frequency” 
global attribute as specified in Table 2.

see https://goo.gl/v1drZl

So as @wachsylon has noted, if we allow 6 digits for year, unambiguous interpretation of the date is impossible without also determining the frequency. Since all current options have an even number of digits for the dates, we could allow year to be either 4 or 5 digits without knowledge of the frequency. The template would become [Y]YYYY[MM[dd[...

Is that a good idea? I don't think modifying CMOR would be a problem, but folks trying to parse the date with a 5-digit year might have problems. Does anyone (@durack1 @mauzey1 @matthew-mizielinski @jypeter @mjuckes @davidhassell @martinjuckes) know of any CMIP infrastructure software that parses the dates in the CMIP6 file names?

durack1 commented 2 years ago

@MartinaSt pinging you here

wachsylon commented 2 years ago

If we allow [Y]YYYY, that would include allowing different amount of digits within atomic datasets. E.g. starting from 0001 up to 99 999 would look awkward however I cannot think of an issue any software would have. As another example, variant_label also begins with r1 instead of r01/r001 when there more than 9/99 realizations.

For ism-lig127k-std, it could be that the request only includes yearly frequencies so that there will be no ambiguities for that experiment. I learned that in our paleo project PalMod2, we have experiments going beyond 100 k AND monthly frequency output to be published.

A solution might be to use sub_experiment_id. The experiment then can be split up into parts registered and published as sub experiments.

wachsylon commented 2 years ago

For ism-lig127k-std, it could be that the request only includes yearly frequencies so that there will be no ambiguities for that experiment.

Never mind! Even daily output is requested :)

matthew-mizielinski commented 2 years ago

For this edge case I don't have a big problem with extending the format to allow for one extra digit to cover years 10k-99k, but as Karl notes going to a 6 digit year will make interpretation of the date numbering with the current naming scheme tricky. We need to have a think about whether there are some sensible tweaks to the naming convention we use for the future to explicitly include frequency, without introducing too much in the way of disruption for users.

I wouldn't be surprised if some downstream tools will struggle to interpret the new date strings as and when they come across data formatted in this way, but as noted above this is the only experiment within CMIP6 that has this extent.

As an experiment I've just run a test and have managed to produce a file for an existing CMIP6 simulation with a 5 digit year; tas_Amon_HadGEM3-GC31-MM_amip_r1i1p1f3_gn_1190001-1190012.nc. No changes to CMOR were required here (although I had to adjust my tools slightly), and PrePARE passed this fine.

The next question I would pose would be whether the ESGF publisher and associated systems will be happy with this (@sashakames -- any thoughts).

taylor13 commented 2 years ago

One clarification. I wrote the template as [Y]YYYY because we want to make it to be generally backward compatible. For runs that might be expected to have values larger than 9999, we might recommend or insist that all 5 digits be included for all years, so, for example, "02022", not "2022" would designate this year in such runs.

durack1 commented 2 years ago

Haven't thought this through, but the time format could be tweaked from 20220215-20220216 to 2022-02-15-2022-02-16 this would then naturally allow any number of years, e.g. 100000 in the case of PMIP. Of course, we are adding 6 characters (-), but that does provide flexibility. I haven't through about extending this to sub daily (including hour info)

taylor13 commented 2 years ago

Yes, for a future DRS version, we could alter it as you suggest (although the hyphen separating the two dates would be more difficult to identify; I guess you could require the year to be at least 3 digits and search for the hyphen that precedes a string segment with more than 2 characters and no hyphen, but that is a bit complicated). The new template would not be backward compatible with the current DRS, so probably not a good option for immediate adoption.

matthew-mizielinski commented 2 years ago

Haven't thought this through, but the time format could be tweaked from 20220215-20220216 to 2022-02-15-2022-02-16 this would then naturally allow any number of years, e.g. 100000 in the case of PMIP. Of course, we are adding 6 characters (-), but that does provide flexibility. I haven't through about extending this to sub daily (including hour info)

If we take this route we could go with a double dash as the separator; e.g. 2022-02-15--2022-02-16 and 2022-02--2052-03, but as Karl notes this is one for the future. There is a whole ISO standard on date times that we could use; for sub daily frequencies we could use, 2022-02-15T0000--2022-02-16T0000 for example. ISO 8601 appears to use / to separate the start and end dates of a period, but I think that would just be too confusing here.

matthew-mizielinski commented 2 years ago

One clarification. I wrote the template as [Y]YYYY because we want to make it to be generally backward compatible. For runs that might be expected to have values larger than 9999, we might recommend or insist that all 5 digits be included for all years, so, for example, "02022", not "2022" would designate this year in such runs.

Just thinking aloud; to have 5 digits used for years within an experiment we'd need to have the start / end dates or number of years included in the CMIP6_CV.json file, and then alter the behaviour of CMOR based on that value. However, there are also some experiments with a minimum number of years, but no maximum (e.g. piControl), which could (in theory) cross the 10k year boundary*. Trying to consistently handle this could get messy.

*A suitably fast model and commitment from the scientists running it would be required.

durack1 commented 2 years ago

Just thinking aloud; to have 5 digits used for years within an experiment we'd need to have the start / end dates or number of years included in the CMIP6_CV.json file, and then alter the behaviour of CMOR based on that value. However, there are also some experiments with a minimum number of years, but no maximum (e.g. piControl), which could (in theory) cross the 10k year boundary*. Trying to consistently handle this could get messy.

*A suitably fast model and commitment from the scientists running it would be required.

Exactly, I don't see a path forward that doesn't break the existing YYYYMMDD DRS-defined format that is expected by CMIP6, but maybe I am missing something?

sashakames commented 2 years ago

As far as publishing, the first concern is ensuring that Python can parse the "days since YYYY[Y]-MM-DD" We have been tripped up by several atypically formatted years with preceding 0's. The second is whether python timedelta supports such long year intervals in order to give the full range. I'm not sure to what extent those are tested.

To clarify, publishing is unaffected by the file naming scheme.

taylor13 commented 2 years ago

You raise a good (different) point. If the usual python codes can't handle the "units" attribute when year exceeds "9999", or if it can't calculate elapsed time for those units, we're in trouble. Anyone know on limitations of cdtime and similar modules?

durack1 commented 2 years ago

@sashakames that was where my mind had started to wander too, within CF there are no examples that default from the "days since YYYY-MM-DD HH:MM:SS.x -x.xx" or their example "seconds since 1992-10-8 15:15:42.5 -6:00".

They also include a paleoclimate calendar, which is:

double time(time) ;
  time:long_name = "time" ;
  time:units = "days since 1-1-1 0:0:0" ;
  time:calendar = "126 kyr B.P." ;
  time:month_lengths = 34, 31, 32, 30, 29, 27, 28, 28, 28, 32, 32, 34 ;

Details are from https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch04s04.html.

I agree that testing whatever we work toward through software packages is a key test.

durack1 commented 2 years ago

Ok and that answers that:

In [5]: import cdtime

In [6]: cdtime.relativetime(31,"".join(['days since 10000-01-01 0:0:0.0']))
Out[6]: 31.000000 days since 10000-01-01 0:0:0.0

In [7]: cdtime.relativetime(31,"".join(['days since 100000-01-01 0:0:0.0']))
Out[7]: 31.000000 days since 100000-01-01 0:0:0.0

In [8]: a = cdtime.relativetime(31,"".join(['days since 100000-01-01 0:0:0.0']))

In [9]: a
Out[9]: 31.000000 days since 100000-01-01 0:0:0.0

In [10]: a.torel('days since 1-1-1')
Out[10]: 36523917.000000 days since 1-1-1

In [11]: a.torel('days since 1-1-1 12:12:12.5 -8.0')
Out[11]: 36523916.491522 days since 1-1-1 12:12:12.5 -8.0

Looks like cdtime can deal with arbitrary stuff easily, I wonder how other packages work?

sashakames commented 2 years ago

@durack1 Good to know cdtime appears rather flexible, so a potential solution if problems with timedelta.

durack1 commented 2 years ago

It would be useful to pick up this thread with the experience that @tomvothecoder and @pochedls have been generating using xcdat with cftime

davidhassell commented 2 years ago

Hi,

I only have experience with cftime, which certainly handles years with > 4 digits, but can only parse ISO 8601-style dates, e.g. with hyphen separators between the year, month and day.

import cftime t1 = cftime._dateparse('days since 1234567-12-01 12:00', 'Gregorian') t2 = cftime._dateparse('days since 1234568-01-01 12:00', 'Gregorian') t1 cftime.datetime(1234567, 12, 1, 12, 0, 0, 0, calendar='standard', has_year_zero=False) t2 cftime.datetime(1234568, 1, 1, 12, 0, 0, 0, calendar='standard', has_year_zero=False) t2 - t1 datetime.timedelta(days=31)

I presume that moving to ISO 8601-style dates would be too harmful to backwards compatibility?

On Wed, 17 Aug 2022 at 02:17, Paul J. Durack @.***> wrote:

It would be useful to pick up this thread with the experience that @tomvothecoder https://github.com/tomvothecoder and @pochedls https://github.com/pochedls have been generating using xcdat with cftime

— Reply to this email directly, view it on GitHub https://github.com/PCMDI/cmor/issues/648#issuecomment-1217347163, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6AA4B6NSAHWJO67VM4HD3VZQ4SDANCNFSM5OD2C5UA . You are receiving this because you were mentioned.Message ID: @.***>

-- David Hassell National Centre for Atmospheric Science Department of Meteorology, University of Reading, Earley Gate, PO Box 243, Reading RG6 6BB http://www.met.reading.ac.uk/

taylor13 commented 2 years ago

For paleoclimate simulations (or simulations initiated in very early historic time -- sometime Before the Common Era), A negative year might appear (although this would rule out use of both the the "standard" and "julian" calendars). Perhaps we should think about how that would be handled too.

Perhaps insert a special character before the year? (e.g., "B" for BCE, or "M" for minus, or "N" for negative)

durack1 commented 2 years ago

I've just marked this as a CMOR 4.0 item, as it would be great to catch this and other tweaks as we spec out a next-gen roadmap

taylor13 commented 4 months ago

As I read the above, we haven't really come to a consensus on how to proceed with this.

PCMDI / cmor

Support for years with 5 digits #648