cf-json / cf-json.github.io

Apache License 2.0
0 stars 2 forks source link

Time coordinate values as strings #14

Open aportagain opened 5 years ago

aportagain commented 5 years ago

Split out from https://github.com/cf-json/cf-json.github.io/issues/10 , where the discussion had gotten this far:

On 2019-03-29T05:52:10Z, @aportagain said:

Like @ChrisBarker-NOAA and @kwilcox have mentioned, I too wish that for our cf-json purposes we could just use some variation of nco-json as the actual format with a global Conventions attribute of CF-X.Y... Maybe not impossible, but right now one deal breaker remaining seems to be time coordinate variables: the current CF working draft to my understanding still only allows numerical values, but we really want to keep strings as an option for this, so technically cf-json "breaks" CF compliance in this respect. I'm not entirely across the whole udunits "coupling"... CF still seems to do this, but maybe the Common Data Model version 4 actually doesn't? I found a related trac ticket from over a decade ago: https://cf-trac.llnl.gov/trac/ticket/14 , and funny enough this has recently come up again in a calendar issue: cf-convention/cf-conventions#148 so maybe there's hope?

On 2019-03-29T15:45:45Z, @ChrisBarker-NOAA said:

Maybe not impossible, but right now one deal breaker remaining seems to be time coordinate variables: the current CF working draft to my understanding still only allows numerical values, but we really want to keep strings as an option for this, so technically cf-json "breaks" CF compliance in this respect.

No more than storing time with a different encoding in netCDF breaks CF. In order to work with datetimes in any environment, you need a decent datetime library — expecting that in a consumer of the data is quite reasonable.

Ideally, we have one spec that maps the netCDF data model to JSON, and one that maps CF to JSON (which is probably almost “just do the same thing as you do in netCDF” we should not extend CF just to make it a little easier for readers without the right tools. ( and most ( I think JavaScript as well) datetime software works with C style seconds-since-an-ephoc under the hood anyway.)

I'm not entirely across the whole udunits "coupling"... CF still seems to do this, but maybe the Common Data Model version 4 actually doesn't?

I hope not — I hate that too. But again, that’s an argument for CF — it’s a bad Idea to change/extend CF itself just for JSON. And much as I don’t like the Units coupling, if you see it as; “unit handling and definitions for CF are specified in this other doc” it’s not so bad.

And funny enough this has recently come up again in a calendar issue: cf-convention/cf-conventions#148 <cf-convention/cf-conventions#148> so maybe there's hope?

Maybe — but that’s kinda stalled out.

On 2019-03-29T17:07:33Z, @BobSimons said:

Well, anyone can do whatever they want. But there are advantages to sticking to the standard and (in this case) not writing String times as dimension values in files that are supposed to be CF compliant (until the CF standard says it's okay and how to do it): While the format/meaning of the time strings may be obvious to you or a human and may even follow a standard like ISO 8601:2004(E), there is no standard way in CF to specify the format of a time string (e.g., Java's "yyyy-MM-dd'T'HH:mm:ssZ"), so there is no way for software written to follow the CF specification to deal with String dimension values and know what the format is (how to parse them). There are literally 1000's of time formats in use in scientific data files. Some of them can't even be deciphered by humans because 1 or 2-digit year values make the values ambiguous. Let's avoid this problem or deal with it properly (in CF). One of the big advantages of following a standard is that software can work with the files automatically. Otherwise, everyone has to write custom software to deal with each of the non-standard file variants.

ChrisBarker-NOAA commented 5 years ago

While the format/meaning of the time strings may be obvious to you or a human and may even follow a standard like ISO 8601:2004(E), there is no standard way in CF to specify the format of a time string (e.g., Java's "yyyy-MM-dd'T'HH:mm:ssZ")

Yes, there is: in the “timedelta units since an_epoch” string, the format of the epoch is specified—I’m pretty sure it’s ISO 8601.

The problem is that CF requires time coordinates to be stored in that “encoding”, rather than as an array of datetime strings.

That was a proposed and rejected a couple years ago, though there is a whole active discussion about time that MAY re-omen that discussion.

But while datetime strings may seem more JSON friendly— I think the real driver is use cases - the CF way if describing time is a good one if you want to work with the time access numerically — computing rates of change, etc.

The string representation on the other hand is better for things like timestamps when a measurement was taken.

But these considerations really aren’t any different for JSON than netCDF.

Either way, someone is going to need a decent datetime library for working with time.

-CHB

so there is no way for software written to follow the CF specification to deal with String dimension values and know what the format is (how to parse them).

There are literally 1000's of time formats in use in scientific data files. Some of them can't even be deciphered by humans because 1 or 2-digit year values make the values ambiguous. Let's avoid this problem or deal with it properly (in CF). One of the big advantages of following a standard is that software can work with the files automatically. Otherwise, everyone has to write custom software to deal with each of the non-standard file variants.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cf-json/cf-json.github.io/issues/14, or mute the thread https://github.com/notifications/unsubscribe-auth/AA38YLfDdcznQOqg_e4wmGUTXyMqHURkks5vdAEtgaJpZM4cZaix .

aportagain commented 5 years ago

But while datetime strings may seem more JSON friendly— I think the real driver is use cases - the CF way if describing time is a good one if you want to work with the time access numerically — computing rates of change, etc. The string representation on the other hand is better for things like timestamps when a measurement was taken. But these considerations really aren’t any different for JSON than netCDF.

I totally agree that the "real driver" is, or should be, common use cases, and I think that's where JSON and netCDF have at least one major difference: JSON is human-readable (or rather can be, and in the vast majority of use cases I've come across is), netCDF isn't. So allowing human-readable date / time strings (ideally probably a very small subset of ISO 8601, or maybe even some derivative of RFC3339 with some small modifications?) preserves this characteristic of JSON, which I think is very valuable (for development, especially across languages and related software ecosystems, manual manipulation in interactive environments of interpreted languages, logging, debugging, network inspection, in-browser inspection, database support, etc.).

aportagain commented 5 years ago

Either way, someone is going to need a decent datetime library for working with time.

Good point. In environments where the common format for the storage or transport of this kind of data is netCDF, I think there's a high chance that you'll also be able to directly or indirectly use the official netCDF (C/C++/Fortran; from memory I think the Java one can't write netCDF4, right?) and/or udunits (C) libraries. JSON, on the other hand, is used in a lot of environments where that is not the case (e.g., in-browser, in-database, serverless cloud functions-as-a-service), so providing functionality to mimic udunits behaviour is tricky... while functionality to deal with ISO 8601 strings (or at least some subset) is widely available.

ChrisBarker-NOAA commented 5 years ago

JSON is human-readable ... netCDF isn't. So allowing human-readable date / time strings ... preserves this characteristic of JSON, which I think is very valuable (for development)

Good point -- I'll mention that the fact that the C netcdf lib's "ncdump" supports:

''' [-t] Output time data as date-time strings '''

Was used as an argument for why a string encoding for datetime was not necessary in netcdf.

ChrisBarker-NOAA commented 5 years ago

Either way, someone is going to need a decent datetime library for working with time.

... and/or udunits (C) libraries.

udunits is almost irrelevant here -- yes, it does handle translation of seconds to hours, etc, (and annoyingly defines "month" and "year" for such translations... but it is not a full feature datetime lib. From the docs:

"You should use a true calendar package rather than the UDUNITS-2 package to handle time. "

To do real work, you need something more complete -- e.g. Python datetime module.

JSON, on the other hand, is used in a lot of environments where that is not the case (e.g., in-browser, in-database, serverless cloud functions-as-a-service), so providing functionality to mimic udunits behaviour is tricky... while functionality to deal with ISO 8601 strings (or at least some subset) is widely available.

The string parsing is only a small set of what you need -- and it's the easy part (and ironically, not handled by Python's DateTime :-) ). My point is that if you want to do things like compute how much time as passed between two timestamps, you need something beyond string parsing.

JSON is by no means only used in Javascript environment, but there have GOT to be datetime libs available.

aportagain commented 5 years ago

JSON is human-readable ... netCDF isn't. So allowing human-readable date / time strings ... preserves this characteristic of JSON, which I think is very valuable (for development)

Good point -- I'll mention that the fact that the C netcdf lib's "ncdump" supports:

''' [-t] Output time data as date-time strings '''

Was used as an argument for why a string encoding for datetime was not necessary in netcdf.

Right, I vaguely remember reading that right at the end of that old trac ticket... So can we say that the fact that we're now having this discussion in the context of human-readable JSON rather than (only) binary netCDF is a new argument, and in favour of reopening that discussion for CF?

ChrisBarker-NOAA commented 5 years ago

So can we say that the fact that we're now having this discussion in the context of human-readable JSON rather than (only) binary netCDF is a new argument, and in favour of reopening that discussion for CF?

Well, I'm not sure where the community is on the idea of de-coupling CF from netcdf at this point -- so I have no idea if that's an argument that will "fly"

But NOTE: IF one allows datetime strings in cf-json, then you are going to run deep into the whole CAlendar question:

What does "gregorian" mean? Does cf-json allow other calendars? (If so, how to convert to date timestrings?) What about UTC and leap seconds?

None of this is easy :-) -- but at least if you stick with the time encoding in CF, the problems are the same everywhere :-)

aportagain commented 5 years ago

Don't current CF time coordinate conventions already not only allow but actually require a datetime string "encoding", namely in the reference time in the units attribute? And therefore have all the associated calendar issues anyway? How does the encoding (as in "numerical value in single given time unit since reference date and time" vs. "year/month/day/hour/minute/second/fraction/offset" with an implied since-calendar-origin) of the actual time coordinate variable values change that?

I think I'm roughly aware of at least some of these calendar issues, but obviously not an expert :) Do I need to read all the way through that "Add calendars gregorian_tai and gregorian_utc" CF issue to understand this, or is there a simpler explanation that I'm missing?

I agree that for further computational / numerical processing, having the the numerical values directly available is more efficient. But once I've decided to work with JSON, computational / numerical efficiency is really not the first thing on my mind... more like the last :)

aportagain commented 5 years ago

@BobSimons , I agree with pretty much everything you said, I'm just honestly not sure whether we're reaching slightly different conclusions from the same premises, or we're both preaching to the choir and (I?) just haven't realised it yet... :)

But there are advantages to sticking to the standard and (in this case) not writing String times as dimension values in files that are supposed to be CF compliant (until the CF standard says it's okay and how to do it):

Absolutely, I'll make every effort not to claim compliance unless it is truly compliant. Which is one of the reasons why I think we can't quite (yet) fully ditch cf-json and just use some new level of nco-json with a global conventions attribute of CF-X.Y. Sounds to me like we may either need to try to get one or two changes in CF (maybe it's really just time coordinate values as strings?), or keep cf-json as a separate, but very thin spec (convention?) that says use nco-json as the actual format, and adhere to CF-X.Y, but with a very short list of modifications.

While the format/meaning of the time strings may be obvious to you or a human and may even follow a standard like ISO 8601:2004(E), there is no standard way in CF to specify the format of a time string (e.g., Java's "yyyy-MM-dd'T'HH:mm:ssZ"), so there is no way for software written to follow the CF specification to deal with String dimension values and know what the format is (how to parse them). There are literally 1000's of time formats in use in scientific data files. Some of them can't even be deciphered by humans because 1 or 2-digit year values make the values ambiguous. Let's avoid this problem or deal with it properly (in CF).

Are these ambiguities in ISO8601 that you're thinking of? I haven't encountered any, but I've only ever dealt with certain subsets of the ISO8601 options. And I very much hope that any CF or cf-json reference to ISO8601 would be restricted to a small and well-defined subset or derivation (which I think is kind of also what THREDDS does? https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/CDM/CalendarDateTime.html#ISO_String). I really wish RFC3339 wasn't restricted to the "current era", but as far as I know, it neither allows more than four digits nor a sign for years, right? Still, works fine for many, many use cases, and maps directly to JSON Schema and consequently OpenAPI (string data type "date-time" format modifier; https://json-schema.org/latest/json-schema-validation.html#rfc.section.7.3.1 and https://swagger.io/docs/specification/data-models/data-types/#string) so might still be worth considering...

One of the big advantages of following a standard is that software can work with the files automatically. Otherwise, everyone has to write custom software to deal with each of the non-standard file variants.

Yup, exactly that last bit is one of the reasons why I like something like a subset of ISO8601, just that I'm probably thinking of a different environment / software ecosystem: if we only allow the current CF / udunits way, the consumers that I'm thinking of would be forced to reimplement a subset of the udunits functionality, instead of being able to plug-and-play one of the many existing implementations of ISO8601 parsing / generating functionality.

ChrisBarker-NOAA commented 5 years ago

Don't current CF time coordinate conventions already not only allow but actually require a datetime string "encoding", namely in the reference time in the units attribute?

no -- I"m not sure why there is discussion about hat, parsing the string is NOT the problem.

And therefore have all the associated calendar issues anyway? How does the encoding (as in "numerical value in single given time unit since reference date and time" vs. "year/month/day/hour/minute/second/fraction/offset" with an implied since-calendar-origin) of the actual time coordinate variable values change that?

I was thinking primarily of converting a CF netcdf file to JSON, but the same applies if you have, e.g. model data that that is naturally in seconds since the start of the model run.

Do I need to read all the way through that "Add calendars gregorian_tai and gregorian_utc" CF issue to understand this, or is there a simpler explanation that I'm missing?

Well, that's not a very efficient way to get the info -- but yes, kinda :-(

They key issue is that there are two primary use cases -- the one that CF was originally developed for, often model data -- is naturally in some timedelta since a timestamp. 00 that is, you started the model at some datetime, and the output every timestep.

A very different use case is things like measurements taken by an instrument with a timestamp provided when the measurement was taken -- these are discreet events, each with a timestamp, so a string timestamp is natural here. (though there are still issues with Calendars, and leap seconds, and timezones...)

But in the end, this is a big mess -- but if cf-json follows CF, at least it's the SAME mess :-)

I agree that for further computational / numerical processing, having the the numerical values directly available is more efficient. But once I've decided to work with JSON, computational / numerical efficiency is really not the first thing on my mind... more like the last :)

It's not about efficiency, it's about computability:

How many seconds have passed between Jan 20, 1985 at 10:43 and March 3, 1985 at 1213 ?

You need a datetime lib to do that. whereas if you want to know a time duration between two points on a time access encoded as "seconds since 1980-01-01T00:00" it's a trivial computation.

And once you have that time lib, then working with the standard CF approach is not hard.

And no, we aren't having to implement a subset of UDunits --more like a superset, when it coems to time processing -- UDunits does not provide much.

BobSimons commented 5 years ago

Please see my comments below...

@BobSimons , I agree with pretty much everything you said, I'm just honestly not sure whether we're reaching slightly different conclusions from the same premises, or we're both preaching to the choir and (I?) just haven't realised it yet... :)

Yes.

But there are advantages to sticking to the standard and (in this case) not writing String times as dimension values in files that are supposed to be CF compliant (until the CF standard says it's okay and how to do it):

Absolutely, I'll make every effort not to claim compliance unless it is truly compliant. Which is one of the reasons why I think we can't quite (yet) fully ditch cf-json and just use some new level of nco-json with a global conventions attribute of CF-X.Y. Sounds to me like we may either need to try to get one or two changes in CF (maybe it's really just time coordinate values as strings?), or keep cf-json as a separate, but very thin spec (convention?) that says use nco-json as the actual format, and adhere to CF-X.Y, but with a very short list of modifications.

I agree very much about making a few small changes to CF. I wrote up 6 basic proposals and tried to get CF to make progress on the first two. But it was horrible. It just got bogged down in endless discussion where people were talking past each other. I gave up on trying to make changes to CF. Clearly, some people are more suited to that process and have the time a patience for it. Perhaps you will have better luck than I did.

While the format/meaning of the time strings may be obvious to you or a human and may even follow a standard like ISO 8601:2004(E), there is no standard way in CF to specify the format of a time string (e.g., Java's "yyyy-MM-dd'T'HH:mm:ssZ"), so there is no way for software written to follow the CF specification to deal with String dimension values and know what the format is (how to parse them). There are literally 1000's of time formats in use in scientific data files. Some of them can't even be deciphered by humans because 1 or 2-digit year values make the values ambiguous. Let's avoid this problem or deal with it properly (in CF).

Are these ambiguities in ISO8601 that you're thinking of? I haven't encountered any,

Yes. There was a 1988 version of ISO 8601 that listed a larger number of formats. The 2004 version of ISO deprecated many of those in favor of a few formats (one for each of several different purposes). I advocate that people use that ISO 8601:2004(E) standard format, for example, 1985-01-02T00:00:00Z (but also with more or less precision, and with different time offset formats also allowed). There is a 2019 version of ISO 8601 -- I haven't read it yet. See [https://en.wikipedia.org/wiki/ISO_8601]

but I've only ever dealt with certain subsets of the ISO8601 options. And I very much hope that any CF or cf-json reference to ISO8601 would be restricted to a small and well-defined subset or derivation (which I think is kind of also what THREDDS does? https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/CDM/CalendarDateTime.html#ISO_String). I really wish RFC3339 wasn't restricted to the "current era", but as far as I know, it neither allows more than four digits nor a sign for years, right?

RFC3339 is similar to ISO 8601:2004(E) (same format) but less suitable because it is just for the Gregorian calendar and so doesn't deal well with dates before 1582 (the switch from Julian to Gregorian) or with other calendars (360 day? 365 day? used in models). (It's complicated.) RFC3339 is really just intended for a very limited scope: recent dates and common usage (e.g., dates on web documents).

But ISO 8601 also doesn't deal with other eras. ISO 8601 says groups can extend use of ISO 8601 to BCE years for use in their group by following an agreed upon convention. For this, I advocate (and use in ERDDAP) using Astronomical Year Numbers (2 CE is year 2 in astronomical years, 1 CE is year 1, 1 BCE is year 0, 2 BCE is year -1, etc), not eras. Astronomical Year Numbers have a lot of advantages. See [https://en.wikipedia.org/wiki/Astronomical_year_numbering]

Still, works fine for many, many use cases, and maps directly to JSON Schema and consequently OpenAPI (string data type "date-time" format modifier; https://json-schema.org/latest/json-schema-validation.html#rfc.section.7.3.1 and https://swagger.io/docs/specification/data-models/data-types/#string) so might still be worth considering...

One of the big advantages of following a standard is that software can work with the files automatically. Otherwise, everyone has to write custom software to deal with each of the non-standard file variants.

Yup, exactly that last bit is one of the reasons why I like something like a subset of ISO8601, just that I'm probably thinking of a different environment / software ecosystem: if we only allow the current CF / udunits way, the consumers that I'm thinking of would be forced to reimplement a subset of the udunits functionality, instead of being able to plug-and-play one of the many existing implementations of ISO8601 parsing / generating functionality.

I understand. That is why I said "Well, anyone can do what they want" within their community. Yes, dealing with CF times is a pain, but CF has answers for many of the complications related to time, and it is (reasonably) easy to work with numerically given a good date time library. Although standardizing on ISO 8601 (hopefully :2004(E)) works well for humans, you still need a library to deal with those values if you want to compare them, manipulate them, analyze related data, etc.

The real problem is that human dealings with time are horribly complex. The more you get into it, the more complex you see it is. There are no easy solutions (unless you limit the scope). (E.g. how does a given system deal with leap seconds? [http://leapsecond.com/java/gpsclock.htm] ). I am all for picking as few standards as possible, and sticking to those in order to minimize the complexity and maximize the size of the community that it works for. But again, your community should feel free to do what is best for your community -- but consciously understand that in doing so you are walking away from other communities, their software tools, and their ability to work with your data (without extra effort). (I think you understand that.)

Good luck. Best wishes.