Missing data handling in boolean processes

LukeWeidenwalker commented 1 year ago

I've already brought this up in the devtelco a while ago and I said I would summarise the problem in a github issue, but was interrupted due to being on sick leave for a while - so here it goes:

Summary of the problem

The logic processes as defined by OpenEO can return True/False/null.

This is fine in R, because the NA value is deeply integrated throughout the language
According to @jdries here it's fine in geotrellis, because there's missing data values for all dtypes

However, in numpy there is no sentinel value for missing data in boolean arrays (like there is with NaN for float). See this blog for a good overview to missing data handling. Also see NEP-26 for the state of this issue in the python community.

To implement and as specified, instead of this:

# Example A: What you'd expect from an implementation for a logical `and` process

logical_and = np.logical_and(x, y)

we have to do this:

# Example B: Workaround we have to use to account for null values are returned by logic processes

nan_x = np.isnan(x)
nan_y = np.isnan(y)
xy = np.logical_and(x, y)
nan_mask = np.logical_and(nan_x, xy)
xy = np.where(~nan_mask, xy, np.nan)
nan_mask = np.logical_and(nan_y, xy)
xy = np.where(~nan_mask, xy, np.nan)
return xy

Note that the output array of Example A has the dtype bool_, whereas the output array of Example B has dtype float64. This is because the missing data value np.nan is only defined for dtype float64 and thus the entire array needs to be upcast to float64.

See these numpy snippets to confirm this:

>>> np.array([True, False])
array([ True, False], dtype='bool')  # boolean array
>>> np.array([True, False]).itemsize
1  # one byte per item

>>> np.array([True, False, np.nan])
array([ 1., 0., nan], dtype='float64')  # casts to float64
>>> np.array([True, False, np.nan]).itemsize
8

Apart from looking really awkward, this also has direct performance implications, as the memory footprint of this array is multiplied by 8 and operations with float64 arrays will also be slower than on pure boolean arrays.

Here's a list of all processes where we've found this to be problematic:

and/not/or/xor
any/all
eq/neq
between
gt/gte/lt/lte
text_begins/text_contains/text_ends

Sorry for the wall of text here, but I'm extremely reluctant to deploy an 8-line implementation of a boolean primitive function that should be 1 line, where it's already clear that it causes a clear performance problem.

Proposed solutions: We cannot use numpy masked arrays to do this, because numpy ufuncs that are applied to an ndarray cannot transform their parent to a masked array. We don't want to transform all parent ndarrays to masked arrays preemptively for performance reasons.

Therefore I'd propose the following:

The processes should not implicitly assume that nodata is supported for all datatypes.
Boolean functions should return True or False, anything else isn't a boolean function. Nodata can be treated as false. On a sidenote, this is already what the if process does according to example 2.

Looking forward to hear what people think about this!

For reference, this is the PR in which I first raised these issues: https://github.com/Open-EO/openeo-processes-dask/pull/40

cc @ValentinaHutter @danielFlemstrom @clausmichele

m-mohr commented 1 year ago

The changes required here would be relatively huge across a lot of processes, not only the boolean ones (e.g. #408). As such we need more evidence that this is not just a numpy limitation that may eventually go away, e.g. how is it handled in other implementations (e.g. WCPS, GrassGIS, Java, Scala, Julia, JS, rust, ...). Right now it seems we have:

Supported: R, geopyspark Not supported: numpy

PS: What's the conclusion from the NEP? Will missing data eventually land in Python or not? @LukeWeidenwalker

clausmichele commented 1 year ago

The numpy implementation is based on C, so I doubt the behaviour will change. I found an interesting discussion on this topic here: https://github.com/numpy/numpy/issues/2316

Anyway @LukeWeidenwalker, even if the Python code needs more lines of code to do the same as R or geopyspark, this doesn't mean it will be slower or require more memory, since probably they also do the same under the hood.

LukeWeidenwalker commented 1 year ago

Anyway @LukeWeidenwalker, even if the Python code needs more lines of code to do the same as R or geopyspark, this doesn't mean it will be slower or require more memory, since probably they also do the same under the hood.

No, it definitely does mean that, memory is allocated depending on the dtype. You can check this by firing up a clean python container with numpy, run the following examples and monitor what happens to your container memory with docker stats:

a = np.arange(0, 10e7, dtype=np.int8)
print(a.shape)
print(a)
# This takes up about 90 MB

a = np.arange(0, 10e7, dtype=np.int64)
print(a.shape)
print(a)
# This takes about 800 MB

a = np.full(10e7, True)
print(a.shape)
print(a)
# This takes about 90 MB

the int8 and bool arrays take exactly the same amount of memory (because both dtypes take 8 bytes), while the int64 array takes 8 times as much. The only difference is the dtype.

clausmichele commented 1 year ago

Yes I do not doubt that different data types have different memory footprints. What I mean is: could it be that other programming languages are also doing an intermediate step converting between types implicitly? Personally I don't know.

soxofaan commented 1 year ago

maybe I'm misunderstanding, but you seem to suggest that the problem is with the process definitions, but isn't the problem related to the input data itself?

# Example A
logical_and = np.logical_and(x, y)

if x and y are "boolean" arrays (e.g. dtype int8), there is no NaN in play anyway as there is no standardized representation for it, so nothing to worry about for the implementation of logical and.

if x and y are NaN-aware that implies that they already have dtype like float64, so again there is no need for conversion. The implementation is indeed uglier probably (ex B), but at least there is no conversion penalty.

LukeWeidenwalker commented 1 year ago

Anyway @LukeWeidenwalker, even if the Python code needs more lines of code to do the same as R or geopyspark, this doesn't mean it will be slower or require more memory, since probably they also do the same under the hood.

Oh, apologies @clausmichele - you meant that it doesn't mean that it is comparatively slower than R or geopyspark, because these might have to do similar conversions under the hood? Yes, totally agree with that!

@m-mohr

PS: What's the conclusion from the NEP? Will missing data eventually land in Python or not? @LukeWeidenwalker

The conclusion is that this NEP is 11 years old and nothing has happened yet, and I was unable to identify any currently active efforts to resolve this. So I think the probability of this problem "going away" anytime soon is rather small.

The changes required here would be relatively huge across a lot of processes, not only the boolean ones (e.g. https://github.com/Open-EO/openeo-processes/issues/408). As such we need more evidence that this is not just a numpy limitation that may eventually go away, e.g. how is it handled in other implementations (e.g. WCPS, GrassGIS, Java, Scala, Julia, JS, rust, ...). Right now it seems we have:

Supported: R, geopyspark Not supported: numpy

I get where you're coming from with this comment, but I don't think that this is the right angle to approach this from in the first place. We don't have the relevant expertise across all these tech stacks, and even if we did, I really don't think this is necessary to meaningfully resolve this question.

Three arguments for why:

Numpy is not fringe: I don't think it's a stretch to claim that numpy is one of the most (perhaps even the most) widely used reference implementation of ndarrays for scientific computing even across ecosystems. If this is not in numpy, that's already strong evidence that missing values for any datatype (especially booleans) isn't a feature that you can just expect to be supported in any given ndarray library.
- Note: This argument could be refuted by someone with deep understanding of numpy internals saying "No, this is just numpy being bad in this regard for legacy reasons".
Performance: We can also look at this from a theoretical perspective (again, see this blog for a summary). How would you implement a representation of missing values in arrays from primitive datatypes? You can either designate a sentinel value, i.e. a specific bit pattern (e.g. -9999, np.nan), to signify that a value is missing. Afaict this doesn't make sense for booleans (because which bitpattern would you use?) and might just not be implemented in a given stack. The second option is to build a mask over the original array that keeps track of which elements were missing. This must have worse performance characteristics than working with the raw arrays, because the mask both takes up memory and needs to be checked against when doing computations. So allowing missing values for any datatype
- Note: This argument could be refuted by someone pointing out that I've misunderstood how this works under the hood.
Is this even useful? What does it even mean for my boolean function to return anything other than True or False? I (and everyone I've spoken to about it) totally did not expect this behaviour when learning about it. I assume that the main usage pattern of these functions is to create masks. I can't think of a case where I'd want to have a mask where some values are undefined. I want my mask to say for every pixel whether it's in our out. The other usage pattern is using these processes in the if process. In if, nodata is treated as false, so this information is ignored again anyways.
- Note: This argument could be refuted by providing a real-life example of a usage pattern I haven't considered.

LukeWeidenwalker commented 1 year ago

maybe I'm misunderstanding, but you seem to suggest that the problem is with the process definitions, but isn't the problem related to the input data itself?
# Example A
logical_and = np.logical_and(x, y)
if x and y are "boolean" arrays (e.g. dtype int8), there is no NaN in play anyway as there is no standardized representation for it, so nothing to worry about for the implementation of logical and.

if x and y are NaN-aware that implies that they already have dtype like float64, so again there is no need for conversion. The implementation is indeed uglier probably (ex B), but at least there is no conversion penalty.

Fair point, but I think that's only true for the narrow example of logical_and. Something like gt is always able to produce a boolean array with missing values, so will cause the upcasting directly.

m-mohr commented 1 year ago

Discussed in the telco today: Next step is to try internal workarounds such as MA as in EO missing values are important.

Open-EO / openeo-processes

Missing data handling in boolean processes #410