Unequal treatment of np.NaNs with operators

fkroeber commented 2 months ago

Description

When evaluating the operators, np.NaNs are treated unevenly, depending on whether they occur in the array x or in the comparison array y. This is due to the use of np.where(pd.notnull(x)...) without the complementary np.where(pd.notnull(y)...) in the operators.py, where one-sided NaNs are retained if they occur in x, but not if they occur in y. The consequence of this is, for example, a violation of the commutativity of Boolean operators (see MRE below).

Reproducible example

import json
import geopandas as gpd
import semantique as sq
from semantique.processor.core import QueryProcessor

# define recipe
recipe = sq.QueryRecipe()

recipe["res_a"] = sq.reflectance("s2_band02").\
    filter_time("before", sq.time_instant("2019-12-31")).\
    evaluate("or", sq.reflectance("s2_band02"))

recipe["res_b"] = sq.reflectance("s2_band02").\
    evaluate("or", sq.reflectance("s2_band02").filter_time("before", sq.time_instant("2019-12-31")))

# create context
with open("files/mapping.json", "r") as file:
    mapping = sq.mapping.Semantique(json.load(file))

with open("files/layout_gtiff.json", "r") as file:
    dc = sq.datacube.GeotiffArchive(json.load(file), src = "files/layers_gtiff.zip")

space = sq.SpatialExtent(gpd.read_file("files/footprint.geojson"))
time = sq.TemporalExtent("2019-01-01", "2020-12-31")

context = {
    "datacube": dc, 
    "mapping": mapping,
    "space": space,
    "time": time,
    "crs": 3035, 
    "tz": "UTC", 
    "spatial_resolution": [-10, 10],
    "track_types": False
}

# execute recipe
fp = QueryProcessor.parse(recipe, **context)
response = fp.optimize().execute()

# evaluate equivalance of both results -> leads to False
response["res_a"].equals(response["res_b"])

Expected behavior

The correctness of the operator algebra should be ensured by consistent handling of NaNs regardless of whether they occur in x or y.

Proposed solution

Replacing the following operator defintions

def f(x, y):
   y = utils.null_as_zero(y)
   return np.where(pd.notnull(x), np.logical_or(x, y), np.nan)

with

def f(x, y):
  return np.where(pd.notnull(x) & pd.notnull(y), np.logical_or(x, y), np.nan)

luukvdmeer commented 2 months ago

If I remember well, this comes down to a more abstract discussion of what nan values really mean and how they should be treated in comparisons. See also: https://github.com/ZGIS/semantique/discussions/15.

The operators are applied to the observations of the first array (x), and may, as is the case with the boolean operators, involve a comparison to the same observation location (in space and time) in a second array (y). So this is designed with some hierarchy in mind. The function is applied to each existing observation in x, not only to each spatio-temporal location that has an existing observation in both x and y (as you would imply when using np.where(pd.notnull(x) & pd.notnull(y), ...). So basically semantique follows the "left-join logic" and not the "inner-join logic".

Now what to do when an observation in x has no corresponding observation in y? Then you could argue that x AND y is by definition false, because there is simply no observation at that location in y (but theoretically there could have been!), hence they are not both true. This is what semantique does, by converting each nan in y to False (do note here that for some reason numpy treats nan values as True). The benefit of this is that the shape of the output array is guaranteed to be equal to the shape of the input array (which is x). Of course the "hierarchy" that semantique uses means that x AND y is not necessarily equal to y AND x, because the operator is applied to each observation in the first array, and the locations of the observations in x and y may not match. Which for programmers I can understand seems unacceptable.

This was designed years ago so I am very open to any change to be made to it, given that it is well thought through what the benefits and implications are :) I would say that we should look at this considering what makes sense for the user that semantique targets, which may not be necessarily the same as what makes sense for a programmer.

(PS It may also be I remembered everything wrong and this all does not make sense :laughing: )

luukvdmeer commented 2 months ago

Sorry, I replied a bit to fast saying a lot of stuff that does not make sense at all ;) As semantique treats it now, it seems indeed to be a bug in any case. We do have the issue with nan values having different meanings, see https://github.com/ZGIS/semantique/discussions/15. If a True observations in x has no corresponding observation in y, the result of the AND operator should intuitively be something as "unknown". But we do not have a way yet to distinguish between things like "the value of this observation is unknown" and "there was no observation at all done at this location". Thoughts?

fkroeber commented 2 months ago

Even though I do see the issue of having multiple semantics for NaN values (meaning that they could represent missing, unknown or invalid figures), I'm not sure if this is related to the current problem or should be seen as something that is rather independent. I don't see how the unequal treatment of NaN depending if they occur in x or y is currently contributing to resolving the different NaN semantics. I also doubt that there is any potential way of leveraging the NaN treatment within the operators module to resolve the NaN semantics issue even if we would restructure it. Two considerations that justify my point of view:

1) If we wanted to use the different NaN treatment to express the semantic differences, we would first have to consider which NaN semantics should result in which operator behaviour. In concrete terms for the current handling of NaN values: What meaning must NaN values in the arrays x and y have so that pulling the NaN values from array x and possibly suppressing the NaN values from y as currently practised corresponds to the meaning of the NaN values and their combination. So NaN values in array y should always be unknown values and are therefore overwritten? If a coherent consensus were to be reached on such or other NaN evaluation behaviour, this semantics-specific handling of NaN values would have to be clearly documented and communicated to the user.

2) Building on point 1: Even if there is a consensus on the meaning of the NaN values in the arrays x and y, the current implementation and any changes to this implementation are unsuitable for correctly implementing such a consensus.

Since NaN values are always evaluated as Boolean true in y, they inevitably have a different semantic meaning depending on the operator (cf. examples in the context of the Boolean bivariate operators or and xor). A semantically desired evaluation behaviour such as ‘ignore the values in y’ can therefore not be achieved with a blanket np.where(pd.notnull(x), <operator>, np.nan).
Across multiple operations, meaning sovereignty over the NaN values cannot be consistently maintained in any case. Specifically: After the application of a certain operator, the result array contains NaN values with different meanings (e.g. missing vs. invalid numbers). For all subsequent operator applications in the recipe using this result array, the clear assignment of NaN meanings is therefore no longer possible. A semantic specification of the NaN definition for the arrays x and y is therefore only meaningful and consistent if different NaN representations (sth along the NA, NAN representations in R) can be introduced.

I therefore regard the current implementation as a bug in any case and also believe that the problematic semantics of the NaN cannot be solved with this - at least not in a simple way ;)

For the given case, I suggest modifying the code so that the correct operator algebra is guarenteed. This is already the case for the algebraic bivariate operators (add, subtract, multiply, ...), as NaN in x and y are transferred equally to the result. The groups of functions that would have to be changed are...

a) relational bivariate operators (less, greater, equal, ...) b) boolean bivariate operators (and, or, xor) and c) membership bivariate operators (in, not_in)

The modification would be a replacement by np.where(pd.notnull(x) & pd.notnull(y), <operator>, np.nan) as mentioned at the beginning. This would mean that NaN values are consistently pulled along, regardless of the operator and whether they occur in x or y, which in my opinion corresponds to the intuitive expectation (also from the user's point of view). I am of course open to hearing your views on this consideration :)

ZGIS / semantique