Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
48 stars 15 forks source link

Clarify temporal intervals #331 #394

Closed m-mohr closed 1 year ago

m-mohr commented 1 year ago

It seems the implementations tend more towards the approach that the two elements in the intervals should not be the same, so I created a PR that proposes this breaking change for v2.0.0.

Fixes #331

m-mohr commented 1 year ago

@clausmichele Good question. I had to think quite a bit about it and I can't remember the exact reasons we had back then, but it clearly originates from ISO8601, but it seems they are also struggling with it:

An amendment was published in October 2022 featuring minor technical clarifications and attempts to remove ambiguities in definitions. The most significant change, however, was the reintroduction of the 24:00:00 format to refer to the instant at the end of a calendar day.

So it was there in the beginning, got removed and then added back again. I assume their reasons for allowing 24 also apply here.

Without knowing their reasoning (ISO standards must be bought 🤮), I found some edge cases where it may matter and surprise users:

Has someone access to the ISO documents? I'd like to read about the 24:00:00 changes because I agree that ideally we could get rid of this special handling.

clausmichele commented 1 year ago

@m-mohr were you able to access the ISO documents?

m-mohr commented 1 year ago

Yes! Thanks to @mkadunc for pointing me into the right direction and also providing his point of view.

24 as the hour

T24:00:00 is valid and behaves the same as the next day's T00:00:00 in ISO8601.

But: I just realized that JSON Schema for the format date-time explicitly mentiones RFC3339 compliance, which disallows 24. So some schema validators may fail on T24:00:00. So we'd either need to remove the format from the schema (and loose automatic validation) or comply and disallowe 24.

Incl./excl. upper boundary

The definition of intervals is including the upper boundary, but other parts of the standard allow excluding the upper boundary, so here the standard is not very stringent and helpful. To avoid breaking everyones implementations, we should probably keep excluding it. Otherwise, users could get different results as now there's an additional day included.

Missing parts of a date-time (e.g. a date)

In ISO8601 it seems that omitted time components should simply be replaced by zero (2022-12-14 is basically 2022-12-14T00:00:00).

Conclusion

Well, it's hard to come to a conclusion here as there's so many different aspects to consider, but we likely need to "break" someones implementation as VITO seems to be contraty to all other implementations. So to come up with a precise definition, someone needs to bite the bullet anyway.

We could go the following route:

clausmichele commented 1 year ago

For me it would also be fine to change the API and make the upper boundary included, since it caused many headaches in the past and it would be easier to understand. However, we need to clarify it in the best way.

If we change the definition I see mostly two cases: Case 1: I provide a complete range like ["2020-01-01T00:00:00Z","2020-01-02T00:00:00Z""] no surprises, I should know what I am asking for and I will also get data at 2020-01-02T00:00:00Z if there's some. -> all good Case 2. I provide just the date ["2020-01-01","2020-01-02"]: I would expect to get data for the 2nd of January as well if we say that the upper boundary is included, but instead the range gets converted with zeros and therefore the data on the 2nd is excluded. -> we have to clarify this to the users

m-mohr commented 1 year ago

For me it would also be fine to change the API and make the upper boundary included, since it caused many headaches in the past and it would be easier to understand. However, we need to clarify it in the best way.

As said above, I don't think this is a good idea. Changing this would make data unreproducible as the now included day would intefere. Results would change without users noticing it.

clausmichele commented 1 year ago

Well, all the users using the VITO back-end wouldn't notice any difference actually, since they already return the upper bound as well. Maybe it's something to discuss in the next dev telco?

mkadunc commented 1 year ago
* Clarify that omitting a component is equivalent to setting it to 0

Is it possible to define a union type for the date-time, date and year, where we could then describe the relationships? Or would we add this clarification to all the process parameters where this union type is used?

* We may also help in client implementations by just allowing to pass a single string (e.g. "2020-01-01") or a corresponding "Date object", which gets transformed internally into a `["2020-01-01T00:00:00Z","2020-01-02T00:00:00Z""]`.

I agree — either overload the process to allow a single parameter, or provide a helper function such as single_day_to_interval("2020-01-01").

* In `aggregate_temporal` we allow just times (without dates) and there it seems impossible to specify the full range as 23:59:59 would be excluded and 24:00:00 is not allowed. We should point users at `null`, which allows `["00:00:00", null]` to cover the full day.

Another option would be to speficy "the full range" with two identical instants - [00:00:00, 00:00:00] would be 24 hours midnight-to-midnight and [06:00:00, 06:00:00] would be 24 hours 6am to 6am. The second example wouldn't be fixed by allowing 24 as the hour. But this is assuming that we want to allowing the time intervals to span more than one calendar day (and I don't know enough about aggregate_temporal use-cases using only time, so I don't know whether this would be needed).

m-mohr commented 1 year ago

@clausmichele But there's not only the VITO implementation. We can change a lot here, but I think the exclusive upper bounds are somewhat set in stone for now due to the reproducibility issue...

Is it possible to define a union type for the date-time, date and year, where we could then describe the relationships? Or would we add this clarification to all the process parameters where this union type is used?

@mkadunc We could, but it would change the schemas quite a bit and make them more complex so I'd try to avoid that. But we have the temporal-interval subtype which we can use and then it's "only" the comparison operators which may need a clarifying word in the parameter description, I think. Although, should eq("2018", "2018-01-01T00:00:00Z") return true or false?

Another option would be to speficy "the full range" with two identical instants - [00:00:00, 00:00:00]

Hmm... something to look into, the aggregate_temporal spec is quite old and I think there's no actual user (and implementation?) for the time part right now.

m-mohr commented 1 year ago

The more you go through the processes to make changes, the worse it gets. I'm looking at between right now, which defines min and max parameters, which can be date-time, date or time. min and max are inclusive. So if I specify between("2010", "2020") right now I'd expect that everything from 2010-01-01T00:00:00Z to 2020-12-31T23:59:59Z (both inclusive) is returning true, right? If I now add the "missing components are 0 (time)/1 (date)" part, this gets ugly as the "included" part of 2020 is really just the "first instance" (~ the first millisecond) in 2020.

mkadunc commented 1 year ago

Although, should eq("2018", "2018-01-01T00:00:00Z") return true or false?

If we want to stick to JSON as the only true context for openEO types, then we have to go with the fact that these are all JSON strings and therefore return false.

If we want, we could define date and time types for openEO in more detail, in which case we could either go with "date- or time-like JSON strings represent intervals", like this:

... or we could specify that "all date- or time-like JSON strings represent single instants" (in line with ISO 8601 interpretation), in which case we have:

Intuitively, the first option would be more correct, but it does add some complexity to the whole system (all date- and time-like parameters would be intervals rather than single scalars)

m-mohr commented 1 year ago

There's a proposal up to simplify the comparison processes in PR #399 so that we (hopefully) don't need to bother about them in this issue/PR any longer.

m-mohr commented 1 year ago

Ready for review!

m-mohr commented 1 year ago

Has anyone the capacity to review this? @dthiex @soxofaan @clausmichele @mkadunc @LukeWeidenwalker Thanks!