Closed LukeWeidenwalker closed 1 year ago
The changes required here would be relatively huge across a lot of processes, not only the boolean ones (e.g. #408). As such we need more evidence that this is not just a numpy limitation that may eventually go away, e.g. how is it handled in other implementations (e.g. WCPS, GrassGIS, Java, Scala, Julia, JS, rust, ...). Right now it seems we have:
Supported: R, geopyspark Not supported: numpy
PS: What's the conclusion from the NEP? Will missing data eventually land in Python or not? @LukeWeidenwalker
The numpy implementation is based on C, so I doubt the behaviour will change. I found an interesting discussion on this topic here: https://github.com/numpy/numpy/issues/2316
Anyway @LukeWeidenwalker, even if the Python code needs more lines of code to do the same as R or geopyspark, this doesn't mean it will be slower or require more memory, since probably they also do the same under the hood.
Anyway @LukeWeidenwalker, even if the Python code needs more lines of code to do the same as R or geopyspark, this doesn't mean it will be slower or require more memory, since probably they also do the same under the hood.
No, it definitely does mean that, memory is allocated depending on the dtype.
You can check this by firing up a clean python container with numpy, run the following examples and monitor what happens to your container memory with docker stats
:
a = np.arange(0, 10e7, dtype=np.int8)
print(a.shape)
print(a)
# This takes up about 90 MB
a = np.arange(0, 10e7, dtype=np.int64)
print(a.shape)
print(a)
# This takes about 800 MB
a = np.full(10e7, True)
print(a.shape)
print(a)
# This takes about 90 MB
the int8
and bool
arrays take exactly the same amount of memory (because both dtypes take 8 bytes), while the int64
array takes 8 times as much. The only difference is the dtype.
Yes I do not doubt that different data types have different memory footprints. What I mean is: could it be that other programming languages are also doing an intermediate step converting between types implicitly? Personally I don't know.
maybe I'm misunderstanding, but you seem to suggest that the problem is with the process definitions, but isn't the problem related to the input data itself?
# Example A
logical_and = np.logical_and(x, y)
if x
and y
are "boolean" arrays (e.g. dtype int8), there is no NaN in play anyway as there is no standardized representation for it, so nothing to worry about for the implementation of logical and
.
if x
and y
are NaN-aware that implies that they already have dtype like float64, so again there is no need for conversion. The implementation is indeed uglier probably (ex B), but at least there is no conversion penalty.
Anyway @LukeWeidenwalker, even if the Python code needs more lines of code to do the same as R or geopyspark, this doesn't mean it will be slower or require more memory, since probably they also do the same under the hood.
Oh, apologies @clausmichele - you meant that it doesn't mean that it is comparatively slower than R or geopyspark, because these might have to do similar conversions under the hood? Yes, totally agree with that!
@m-mohr
PS: What's the conclusion from the NEP? Will missing data eventually land in Python or not? @LukeWeidenwalker
The conclusion is that this NEP is 11 years old and nothing has happened yet, and I was unable to identify any currently active efforts to resolve this. So I think the probability of this problem "going away" anytime soon is rather small.
The changes required here would be relatively huge across a lot of processes, not only the boolean ones (e.g. https://github.com/Open-EO/openeo-processes/issues/408). As such we need more evidence that this is not just a numpy limitation that may eventually go away, e.g. how is it handled in other implementations (e.g. WCPS, GrassGIS, Java, Scala, Julia, JS, rust, ...). Right now it seems we have:
Supported: R, geopyspark Not supported: numpy
I get where you're coming from with this comment, but I don't think that this is the right angle to approach this from in the first place. We don't have the relevant expertise across all these tech stacks, and even if we did, I really don't think this is necessary to meaningfully resolve this question.
Three arguments for why:
-9999
, np.nan
), to signify that a value is missing. Afaict this doesn't make sense for booleans (because which bitpattern would you use?) and might just not be implemented in a given stack. The second option is to build a mask over the original array that keeps track of which elements were missing. This must have worse performance characteristics than working with the raw arrays, because the mask both takes up memory and needs to be checked against when doing computations. So allowing missing values for any datatype
if
process. In if
, nodata is treated as false, so this information is ignored again anyways.
maybe I'm misunderstanding, but you seem to suggest that the problem is with the process definitions, but isn't the problem related to the input data itself?
# Example A logical_and = np.logical_and(x, y)
if
x
andy
are "boolean" arrays (e.g. dtype int8), there is no NaN in play anyway as there is no standardized representation for it, so nothing to worry about for the implementation of logicaland
.if
x
andy
are NaN-aware that implies that they already have dtype like float64, so again there is no need for conversion. The implementation is indeed uglier probably (ex B), but at least there is no conversion penalty.
Fair point, but I think that's only true for the narrow example of logical_and
. Something like gt
is always able to produce a boolean array with missing values, so will cause the upcasting directly.
Discussed in the telco today: Next step is to try internal workarounds such as MA as in EO missing values are important.
I've already brought this up in the devtelco a while ago and I said I would summarise the problem in a github issue, but was interrupted due to being on sick leave for a while - so here it goes:
Summary of the problem
The logic processes as defined by OpenEO can return True/False/null.
NA
value is deeply integrated throughout the languageHowever, in numpy there is no sentinel value for missing data in boolean arrays (like there is with NaN for float). See this blog for a good overview to missing data handling. Also see NEP-26 for the state of this issue in the python community.
To implement
and
as specified, instead of this:we have to do this:
Note that the output array of Example A has the dtype
bool_
, whereas the output array of Example B has dtypefloat64
. This is because the missing data valuenp.nan
is only defined for dtypefloat64
and thus the entire array needs to be upcast tofloat64
.See these numpy snippets to confirm this:
Apart from looking really awkward, this also has direct performance implications, as the memory footprint of this array is multiplied by 8 and operations with
float64
arrays will also be slower than on pure boolean arrays.Here's a list of all processes where we've found this to be problematic:
Sorry for the wall of text here, but I'm extremely reluctant to deploy an 8-line implementation of a boolean primitive function that should be 1 line, where it's already clear that it causes a clear performance problem.
Proposed solutions: We cannot use numpy masked arrays to do this, because numpy ufuncs that are applied to an ndarray cannot transform their parent to a masked array. We don't want to transform all parent ndarrays to masked arrays preemptively for performance reasons.
Therefore I'd propose the following:
if
process does according to example 2.Looking forward to hear what people think about this!
For reference, this is the PR in which I first raised these issues: https://github.com/Open-EO/openeo-processes-dask/pull/40
cc @ValentinaHutter @danielFlemstrom @clausmichele