apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.37k stars 3.49k forks source link

[R] Arrays containing -2147483648 are converted to NA in R #40194

Open Wainberg opened 7 months ago

Wainberg commented 7 months ago

Describe the bug, including details regarding any error messages, version, and platform.

R uses -2147483648 (int32_min) to represent missing integer values (NA). When converting Arrow arrays to R using the C API and then casting to R vectors using as.vector, arrays containing -2147483649 and below are converted to bit64::integer64, but if the minimum value of the array is exactly -2147483648, all the -2147483648s are converted to NA. This is an edge case but it's an important one, because int32_min is often used as a special sentinel value.

Arrow should update the out-of-range check that decides whether to convert to bit64::integer64, to use -2147483647 as the minimum valid int32 rather than -2147483648.

Component(s)

R

paleolimbot commented 7 months ago

Reprex:

library(arrow, warn.conflicts = FALSE)

arrow_array("-2147483648")$cast(int32()) |> as.vector()
#> [1] NA

This conversion happens in one of a few places, depending on whether options(arrow.use_altrep = TRUE) and how the caller of the R C API is consuming the ALTREP array (by INTEGER_ELT(), INTEGER_GET_REGION(), or DATAPTR_RO().

https://github.com/apache/arrow/blob/214378b522a36fbf6010e3d4f5470abaca7bf92e/r/src/array_to_vector.cpp#L193

https://github.com/apache/arrow/blob/214378b522a36fbf6010e3d4f5470abaca7bf92e/r/src/altrep.cpp#L282-L283

https://github.com/apache/arrow/blob/214378b522a36fbf6010e3d4f5470abaca7bf92e/r/src/altrep.cpp#L310-L311

https://github.com/apache/arrow/blob/214378b522a36fbf6010e3d4f5470abaca7bf92e/r/src/altrep.cpp#L259-L262

Checking for specific int32 values is potentially expensive (but safer) in the ALTREP scenario...technically there would be an identical problem with int64 conversions to R's integer64 class.

It looks like nanoarrow has an identical problem here:

library(arrow, warn.conflicts = FALSE)

arrow_array("-2147483648")$cast(int32()) |>
  nanoarrow::as_nanoarrow_array() |> 
  as.vector()
#> [1] NA