Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
48 stars 15 forks source link

array_concat: difficulty representing heterogenous arrays in numpy #407

Closed LukeWeidenwalker closed 1 year ago

LukeWeidenwalker commented 1 year ago

Process ID: array_concat

Describe the issue: The array_concat process currently allows concatenating arrays with arbitrary datatypes. For the Numpy/Dask implementation, this is weird, because ndarrays are expected to be homogenouous (or incur heavy performance degradation if using the object type). This means that effectively there's always implicit type casting going on under the hood, which might not be what the user expects. E.g. a string array and an int64 array, when concatenated together will be cast to a Unicode type, rather than staying heterogenuous as required by the spec.

Proposed solution: Disallow heterogenuous arrays from being created at all with array_concat and throw an error if this is attempted. Attempt casting between numerical types where it's safe and throw an error if it's not.

Additional context: Illustration in numpy:

>>> import numpy as np
>>> array1 = np.array(["a", "b"]) # dtype: `<U1` (Unicode with 1 char)
>>> array2 = np.array([1, 2]) # dtype: `int64`

>>> result = np.concatenate([array1, array2])
array(['a', 'b', '1', '2'], dtype='<U21')

cc @ValentinaHutter

edzer commented 1 year ago

I think the openEO data cube assumes cell values are always numbers, while leaving implicit how they are actually implemented (uint8, int64 etc or floats). It may make sense at some stage to formalise this: certain operations should result in floats rather than ints in order to not loose too much precision (sin, log and such, but also divisions).

Another issue is of course how to deal with categorical variables, and in particular the category labels. The ["a", "b"] array should be an array with values (assuming 0-based) [0, 1] and be accompanied by category labels ["a", "b"]. Concatenating such an array with an array with no or different category labels should result in an error or be done with special care.

m-mohr commented 1 year ago

The process is in principle fine as it is, but may need a bit of a clarification, I think. For this it would be great to hear more implementors.

For numpy you could implement different levels based on the data types, right?

  1. data types are the same: great, proceed
  2. data types are not the same but can be converted easily/without loss (e.g. uint8 to int16): convert to int16
  3. data types are not the same but can not be converted easily/without loss: Throw a warning and use the object type or throw an error and abort.

I think the openEO data cube assumes cell values are always numbers

No, openEO data cubes can contain any (scalar) data type. How that's translated internally is up to the implementation, but the user can expect to not worry about this.

LukeWeidenwalker commented 1 year ago

For numpy you could implement different levels based on the data types, right?

  1. data types are the same: great, proceed
  2. data types are the not the same but can be converted easily/without loss (e.g. uint8 to int16): convert to int16
  3. data types are not the same but can not be converted easily/without loss: Throw a warning and use the object type or throw an error and abort.

Thanks for the input - yes, that sounds reasonable! Agree that this probably wants to be clarified in the process description!

jdries commented 1 year ago

The geotrellis backend also implements automatic conversion to wider data types when needed.

m-mohr commented 1 year ago

@LukeWeidenwalker What exactly should we clarify here? If it's just the data type handling, it doesn't belong into the process and I could just add it to the implementation guidelines.

LukeWeidenwalker commented 1 year ago

I was looking at https://processes.openeo.org/#array_concat:

m-mohr commented 1 year ago
  • Remove the example that concats an integer and a string array

It is allowed in openEO, you can remove it of course from your implementation if not supported.

  • In the description mention that numeric types will be cast if safe and an error will be thrown if that's not possible

openEO itself doesn't have different numeric types so it doesn't make sense in the process description. Throwing an error depends on the implementation. But I'll add something to the implementation guide.

LukeWeidenwalker commented 1 year ago

It is allowed in openEO, you can remove it of course from your implementation if not supported.

Oh, so this would be an appropriate thing to do? I was under the impression that if we expose a process, we'd need to commit to implement all facets of it? If that's not the case, we'll just start doing that wherever we run into limitations with the underlying technologies we're using, and log appropriate errors to inform users.

m-mohr commented 1 year ago

openEO itself is indeed flexible and we have these schema-based processes so that you can customize them if needed. This works best if the change is in the schematic part and not in the descriptions so that clients can read and handle them.

In this case, it's just an example, which means it's less "prominent" to users, but it doesn't necessarily imply to a client/user that mixed types are (not) allowed. You can only mention in the description that mixed types are not allowed.

The other topic is of course what openEO Platform requires in the federation contract and here it might be more strict, especially if at some point a test suite comes into play.

Still, I'll make a small change to the process examples so that we don't just have a mixed type example.

m-mohr commented 1 year ago

@LukeWeidenwalker Here's an updated version: https://github.com/Open-EO/openeo-processes/commit/db242a887c313e7ba9b28440da98d390e6f986a9 I'd propose you try to convert on a best-effort basis and otherwise throw an error. But it's still good to bring up implementation issues whenever you run into issues as we may solve some of them!