Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
48 stars 15 forks source link

order: clarify ordering of ties #409

Closed LukeWeidenwalker closed 1 year ago

LukeWeidenwalker commented 1 year ago

Process ID: order

Describe the issue: The simplest implementation to support the asc parameter is to simply flip the input array before running argsort if descending order is desired. However, the order process currently requires that Ties will be left in their original ordering, regardless of whether asc=True or asc=False. This means that we cannot simply flip the ~input array~ array with the ordered indices, because ties would be flipped too. This makes the implementation more unwieldy than it has to be.

Proposed solution: Change this requirement such that ties will be in the original ordering if asc=True and flipped otherwise.

Additional context: See here for initial discussion

m-mohr commented 1 year ago

@soxofaan @dthiex Do you know how your implementations handle this?

soxofaan commented 1 year ago

we currently use this implementation from the legacy openeo-processes-python repo: https://github.com/Open-EO/openeo-processes-python/blob/f57c663e8708061a8222851a8aec2762ac46b0bb/src/openeo_processes/arrays.py#L745-L817

which behaves like the current spec if I understand correctly:

import numpy as np

data = np.array([1, 2, 2, 3])
print(data)
# ---> [1 2 2 3]

# Ascending
print(numpy.argsort(data, kind="mergesort"))
# ---> [0 1 2 3]

# Descending
print(numpy.argsort(-data, kind="mergesort"))
# ---> [3 1 2 0]
soxofaan commented 1 year ago

The simplest implementation to support the asc parameter is to simply flip the input array before running argsort if descending order is desired. However, the order process currently requires that Ties will be left in their original ordering, regardless of whether asc=True or asc=False. This means that we cannot simply flip the input array, because ties would be flipped too.

maybe I'm misunderstanding, but if you mean flipping the sign of the input data with "flipping the array", I think the descending mode works just fine with numpy.argsort as illustrated above

LukeWeidenwalker commented 1 year ago

Apologies, typo in my description:
I mean after applying argsort, flipping (aka reversing) the array with the ordered indices with np.flip to get the opposite ordering.

Flipping the sign only works on numerical arrays, but will fail on string/datetime arrays.

m-mohr commented 1 year ago

If it works with argsort in the old implementation, why do we need a change here? Can you give some examples, @LukeWeidenwalker?

LukeWeidenwalker commented 1 year ago

Adding the minus-sign like in the old implementation only works on numerical arrays: -data

You cannot do this on e.g. string arrays, it will error out. For those other arrays, you can still get the sorting indices with argsort and flip that array if you want it descending. However, then the order of ties is flipped too - hence this issue!

Depending on your answer to this comment: Other than changing this in the spec, we can of course just not support anything other than numerical arrays in our order process and throw descriptive errors on why this isn't possible. Don't think that's the best option for users though, because that will only be reportable at runtime.

m-mohr commented 1 year ago

Order allows number|null|date-time:string|date:string|time:string - All the strings are temporal, so I'd assume internally they could be timestamps (i.e. numerical)? Can you use the - thing on some kind of date/time objects in Python?

Alternatively, just remove the strings from the schema (or whatever you don't support). This is reasonable, I think.

dthiex commented 1 year ago

Do you know how your implementations handle this?

In our backend we only support number, null or a valid ISO date string as inputs. We then use the javascript sort function which handles ties as described in the process documentation (= Stefaan's example).

LukeWeidenwalker commented 1 year ago

Order allows number|null|date-time:string|date:string|time:string - All the strings are temporal, so I'd assume internally they could be timestamps (i.e. numerical)? Can you use the - thing on some kind of date/time objects in Python?

No, just tried again, this doesn't work!

we currently use this implementation from the legacy openeo-processes-python repo

@soxofaan so how does your process handle datetime strings then?

soxofaan commented 1 year ago

with strings we don't do date interpretation, but that's usually not a problem with ISO8601

A string array is eventually passed to np.argsort and:

A mixed array (strings and numbers) also fails in ascending mode with:

[500] Internal: Server error: TypeError("'<' not supported between instances of 'str' and 'int'")

I don't think we ever had someone complaining about this and requiring proper string support for the order process, so it's not high on our priority list currently

m-mohr commented 1 year ago

It seems it's hard to really come to a conclusion here with the variability of implementations. There are a couple of options:

  1. Implementation: Throw an error if not supported
  2. Implementation: Remove string-types from the JSON Schema
  3. Implementation: Remove the ask parameter from the JSON Schema
  4. Specification: Drop the "asc" parameter in "order" and add a "array_reverse" process in 2.0. I assume the ties issue is not really a thing in "sort"?
  5. Specification: Clarify that "order" always sorts in ascending order and the reverse is only applied afterwards if the "asc" parameter is set to false. This is likely the most interoperable (but slowest) solution as it's a two-step approach that I assume everyone could follow.

The simpliest part would probably be 5, but this would probably lead to a change in SH, right?

with strings we don't do date interpretation, but that's usually not a problem with ISO8601

@soxofaan fyi: The spec says: "Temporal strings can not be compared based on their string representation due to the time zone/time-offset representations."

soxofaan commented 1 year ago

with strings we don't do date interpretation, but that's usually not a problem with ISO8601 @soxofaan fyi: The spec says: "Temporal strings can not be compared based on their string representation due to the time zone/time-offset representations."

yes I know, I just mean that is practically not a problem in VITO backend because we usually only work at date resolution (and if there is a time component we'd always use UTC)

m-mohr commented 1 year ago

Proposal: Replace "Ties will be left in their original ordering." with "The ordering of ties is implementation-dependent."

m-mohr commented 1 year ago

Changed in https://github.com/Open-EO/openeo-processes/commit/dcd3b8a34f68a9539600dc533158cef0c86f4452