apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.28k stars 3.47k forks source link

[C++] Should NaN comparison return false or NaN/NA? #29038

Open asfimport opened 3 years ago

asfimport commented 3 years ago

In working on ARROW-12964 we ran into some corner behaviors with NaN that don't match our (and R's) expectations. It appears that (any?) comparison with NaN results in false:


> Scalar$create(NaN) > 5
Scalar
false

though at least in R this would result in an NA value:


> NaN > 5
[1] NA

The current behavior does match numpy's behavior:


>>> np.nan > 5
False

Reporter: Jonathan Keane / @jonkeane

Related issues:

Note: This issue was originally created as ARROW-13364. Please see the migration documentation for further details.

asfimport commented 3 years ago

Eduardo Ponce / @edponce: EqualOptions has a nans_equal member to control the behavior of comparisons between NaNs. I assume this was included to satisfy the behavior of different tools.

IEEE 754 states that all logical operations with a NaN should always result in false, except for NaN != x. For a discussion on this topic, refer to the first answer of this stack overflow question and wikipedia NaN page.

My opinion for Arrow is that all logical comparisons with a NaN value should return false except for:

asfimport commented 3 years ago

Eduardo Ponce / @edponce: R uses NA to represent a missing value, equivalent to having a NULL bit set in Arrow.

Coercing NaN to logical or integer type gives an NA of the appropriate type, but coercion to character gives the string "NaN". NaN values are incomparable so tests of equality or collation involving NaN will result in NA.

w.r.t. R's behavior for


> NaN > 5
[1] NA

it does not seems to conform strictly to IEEE 754. My speculation is that internally the result is NaN but when coerced as a logical type becomes NA.

 

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche:

EqualOptions has a nans_equal member to control the behavior of comparisons between NaNs. I assume this was included to satisfy the behavior of different tools.

Note this is for a different operation: for a "full array, data-structure equality" (arr1.equals(arr2) = True or False), and the option is added here mainly for convenience (as often you want to regard NaNs in the same location as equal when it comes to full array equality, and writing this out manually is rather verbose, i.e. something like ((a == b) | (a.isnan() & b.isnan()).all()).

We don't have such an option for element-wise comparisons (which is the type of equality/comparison that is discussed in this issue)