null vs NaN - Githubissues

m-mohr commented 8 months ago

Historically, the openEO processes are using null to encode no-data values due to the fact that JSON can't encode NaN (and +/-Infinity). This was always only meant to be a placeholder for communication through the API. Internally, this can be anything. For some data it might be 0 or 255, for some it might be NaN or null, it pretty much depends on the underlying implementation. The actual no-data values were meant to be encoded in the collection and output metadata. Thus, it was always meant that internally the processes can pass around NaN and +/-Infinity.

Now that we are starting with the test suite for individual processes, it occurs that we write tests and expect null to be handled in internal interfaces. Also, the process definitions get a bit akward if we define behavior for null and NaN, e.g. in #479. Sometimes the behvaior is even different (i.e. our null definition derives from what IEEE 754 defines for NaN). And is NaN covered by the ignore_nodata parameters?

I'd like to discuss how to handle this.

soxofaan commented 8 months ago

Note that we already have the nan process under proposal (https://github.com/Open-EO/openeo-processes/blob/master/proposals/nan.json) which can be used to express a NaN value in a (JSON compatible) process graph

m-mohr commented 7 months ago

conclusion from the openEO community call:

Keep nodata value and NaN separate unless NaN is the nodata value
Clarify processes, also don’t use null so much to refer to no-data values, define this more on the schema level

soxofaan commented 7 months ago

(I've been pondering a bit more about this after the openEO community call, and wanted to dump some thoughts here)

This is indeed quite confusing, but it helps to be aware or explicit about which environment or representation level you are talking. Compare that with representation of (unicode) characters, e.g. the German letter ß: it has unicode codepoint U+00DF, in UTF8 it's encoded with two bytes \xC3\x9F, in latin1 encoding it's just one byte \xDF, in HTML you can encode it with ß, in (classic) ASCII it's impossible to represent directly (unless you coerce it to ss), etc...

Likewise, "nodata" is a more symbolic concept that has different representations in different contexts: in pure JSON null seems to be the must sensible option, in IEEE-style floats there is a specific NaN "code point" that is commonly used to encode nodata, in Python you typically use None, in geotrellis (as used in VITO backend) you can define a custom nodata value regardless of the datatype (int, float, ...) of the data tile you're working with, in numpy it depends and requires more DIY hacks (float arrays can use the IEEE-style NaN, but for other dtypes you have to use masked arrays, or object dtype with None), in a spreadsheet you can leave cells empty, in C/C++/Java you can have null pointers, ...

The problem at the level of openeo process specs is that it's done in JSON (schema), so you can only use null as representation of "nodata" . In some places we try to talk about NaN/not a number, but that gets confusing without the proper context. For example, openeo processes defines the processes is_nan(x) and nan() but have subtly different "overloaded" interpretations of "not a number": nan produces the IEEE-float NaN value, which only exist in a IEEE float context, while is_nan accepts anything in a broader JSON context: a string or array is also "not a number".

At the moment I don't see anything that needs fundamental fixing, it's probably a matter of being more explicit about some details and assumptions in the descriptions and docs.

m-mohr commented 6 months ago

Agreed, see PR #490

Open-EO / openeo-processes

null vs NaN #480