Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
48 stars 15 forks source link

Missing data handling in boolean processes #410

Closed LukeWeidenwalker closed 1 year ago

LukeWeidenwalker commented 1 year ago

I've already brought this up in the devtelco a while ago and I said I would summarise the problem in a github issue, but was interrupted due to being on sick leave for a while - so here it goes:


Summary of the problem

The logic processes as defined by OpenEO can return True/False/null.

However, in numpy there is no sentinel value for missing data in boolean arrays (like there is with NaN for float). See this blog for a good overview to missing data handling. Also see NEP-26 for the state of this issue in the python community.

To implement and as specified, instead of this:

# Example A: What you'd expect from an implementation for a logical `and` process

logical_and = np.logical_and(x, y)

we have to do this:

# Example B: Workaround we have to use to account for null values are returned by logic processes

nan_x = np.isnan(x)
nan_y = np.isnan(y)
xy = np.logical_and(x, y)
nan_mask = np.logical_and(nan_x, xy)
xy = np.where(~nan_mask, xy, np.nan)
nan_mask = np.logical_and(nan_y, xy)
xy = np.where(~nan_mask, xy, np.nan)
return xy

Note that the output array of Example A has the dtype bool_, whereas the output array of Example B has dtype float64. This is because the missing data value np.nan is only defined for dtype float64 and thus the entire array needs to be upcast to float64.

See these numpy snippets to confirm this:

>>> np.array([True, False])
array([ True, False], dtype='bool')  # boolean array
>>> np.array([True, False]).itemsize
1  # one byte per item
>>> np.array([True, False, np.nan])
array([ 1., 0., nan], dtype='float64')  # casts to float64
>>> np.array([True, False, np.nan]).itemsize
8

Apart from looking really awkward, this also has direct performance implications, as the memory footprint of this array is multiplied by 8 and operations with float64 arrays will also be slower than on pure boolean arrays.

Here's a list of all processes where we've found this to be problematic:

Sorry for the wall of text here, but I'm extremely reluctant to deploy an 8-line implementation of a boolean primitive function that should be 1 line, where it's already clear that it causes a clear performance problem.

Proposed solutions: We cannot use numpy masked arrays to do this, because numpy ufuncs that are applied to an ndarray cannot transform their parent to a masked array. We don't want to transform all parent ndarrays to masked arrays preemptively for performance reasons.

Therefore I'd propose the following:

Looking forward to hear what people think about this!

For reference, this is the PR in which I first raised these issues: https://github.com/Open-EO/openeo-processes-dask/pull/40

cc @ValentinaHutter @danielFlemstrom @clausmichele

m-mohr commented 1 year ago

The changes required here would be relatively huge across a lot of processes, not only the boolean ones (e.g. #408). As such we need more evidence that this is not just a numpy limitation that may eventually go away, e.g. how is it handled in other implementations (e.g. WCPS, GrassGIS, Java, Scala, Julia, JS, rust, ...). Right now it seems we have:

Supported: R, geopyspark Not supported: numpy

PS: What's the conclusion from the NEP? Will missing data eventually land in Python or not? @LukeWeidenwalker

clausmichele commented 1 year ago

The numpy implementation is based on C, so I doubt the behaviour will change. I found an interesting discussion on this topic here: https://github.com/numpy/numpy/issues/2316

Anyway @LukeWeidenwalker, even if the Python code needs more lines of code to do the same as R or geopyspark, this doesn't mean it will be slower or require more memory, since probably they also do the same under the hood.

LukeWeidenwalker commented 1 year ago

Anyway @LukeWeidenwalker, even if the Python code needs more lines of code to do the same as R or geopyspark, this doesn't mean it will be slower or require more memory, since probably they also do the same under the hood.

No, it definitely does mean that, memory is allocated depending on the dtype. You can check this by firing up a clean python container with numpy, run the following examples and monitor what happens to your container memory with docker stats:

a = np.arange(0, 10e7, dtype=np.int8)
print(a.shape)
print(a)
# This takes up about 90 MB
a = np.arange(0, 10e7, dtype=np.int64)
print(a.shape)
print(a)
# This takes about 800 MB
a = np.full(10e7, True)
print(a.shape)
print(a)
# This takes about 90 MB

the int8 and bool arrays take exactly the same amount of memory (because both dtypes take 8 bytes), while the int64 array takes 8 times as much. The only difference is the dtype.

clausmichele commented 1 year ago

Yes I do not doubt that different data types have different memory footprints. What I mean is: could it be that other programming languages are also doing an intermediate step converting between types implicitly? Personally I don't know.

soxofaan commented 1 year ago

maybe I'm misunderstanding, but you seem to suggest that the problem is with the process definitions, but isn't the problem related to the input data itself?

# Example A
logical_and = np.logical_and(x, y)

if x and y are "boolean" arrays (e.g. dtype int8), there is no NaN in play anyway as there is no standardized representation for it, so nothing to worry about for the implementation of logical and.

if x and y are NaN-aware that implies that they already have dtype like float64, so again there is no need for conversion. The implementation is indeed uglier probably (ex B), but at least there is no conversion penalty.

LukeWeidenwalker commented 1 year ago

Anyway @LukeWeidenwalker, even if the Python code needs more lines of code to do the same as R or geopyspark, this doesn't mean it will be slower or require more memory, since probably they also do the same under the hood.

Oh, apologies @clausmichele - you meant that it doesn't mean that it is comparatively slower than R or geopyspark, because these might have to do similar conversions under the hood? Yes, totally agree with that!


@m-mohr

PS: What's the conclusion from the NEP? Will missing data eventually land in Python or not? @LukeWeidenwalker

The conclusion is that this NEP is 11 years old and nothing has happened yet, and I was unable to identify any currently active efforts to resolve this. So I think the probability of this problem "going away" anytime soon is rather small.

The changes required here would be relatively huge across a lot of processes, not only the boolean ones (e.g. https://github.com/Open-EO/openeo-processes/issues/408). As such we need more evidence that this is not just a numpy limitation that may eventually go away, e.g. how is it handled in other implementations (e.g. WCPS, GrassGIS, Java, Scala, Julia, JS, rust, ...). Right now it seems we have:

Supported: R, geopyspark Not supported: numpy

I get where you're coming from with this comment, but I don't think that this is the right angle to approach this from in the first place. We don't have the relevant expertise across all these tech stacks, and even if we did, I really don't think this is necessary to meaningfully resolve this question.

Three arguments for why:

LukeWeidenwalker commented 1 year ago

maybe I'm misunderstanding, but you seem to suggest that the problem is with the process definitions, but isn't the problem related to the input data itself?

# Example A
logical_and = np.logical_and(x, y)

if x and y are "boolean" arrays (e.g. dtype int8), there is no NaN in play anyway as there is no standardized representation for it, so nothing to worry about for the implementation of logical and.

if x and y are NaN-aware that implies that they already have dtype like float64, so again there is no need for conversion. The implementation is indeed uglier probably (ex B), but at least there is no conversion penalty.

Fair point, but I think that's only true for the narrow example of logical_and. Something like gt is always able to produce a boolean array with missing values, so will cause the upcasting directly.

m-mohr commented 1 year ago

Discussed in the telco today: Next step is to try internal workarounds such as MA as in EO missing values are important.