apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.59k stars 3.54k forks source link

[R] altrep data type slows down evaluation #39004

Open mikerspencer opened 11 months ago

mikerspencer commented 11 months ago

Describe the bug, including details regarding any error messages, version, and platform.

When making evaluations like checking for NA the altrep datatype slows calculation by approx four times. Tested in arrow 10, 12 & 14 on Ubuntu.

library(arrow)
library(dplyr)

# generate data
x = runif(29500000) * 10
d = data.frame(cv = x)
write_dataset(d, "/tmp/data.arrow")
# then read back
df = open_dataset("/tmp/data.arrow/") %>% select(cv) %>% collect()
x = df$cv
y = x + 0

identical(x, y)
microbenchmark::microbenchmark(x={sum(is.na(x))}, y={sum(is.na(y))})

Results:

Unit: milliseconds expr min lq mean median uq max neval
x 291.8 302.2 348.8 310.2 348.8 754.8 100
y 85.3 87.2 108.8 89.3 133.4 225.4 100

With thanks to Barry for the reprex https://mastodon.scot/@geospacedman@mastodon.social/111450704657241188

Component(s)

R

paleolimbot commented 11 months ago

Thanks for opening the issue, and thanks for the reprex!

It is true that ALTREP objects generally perform more slowly than non-ALTREP objects, although I wouldn't have expected this particular operation to be that much slower.

I will dig into this, but in the meantime, you can turn ALTREP off using options(arrow.use_altrep = FALSE):

library(arrow)
library(dplyr)
options(arrow.use_altrep = FALSE)

# generate data
x = runif(29500000) * 10
d = data.frame(cv = x)
write_dataset(d, "/tmp/data.arrow")
# then read back
df = open_dataset("/tmp/data.arrow/") %>% select(cv) %>% collect()
x = df$cv
y = x + 0

identical(x, y)
#> [1] TRUE
microbenchmark::microbenchmark(x={sum(is.na(x))}, y={sum(is.na(y))})
#> Warning in microbenchmark::microbenchmark(x = {: less accurate nanosecond times
#> to avoid potential integer overflows
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq       max neval
#>     x 41.99819 42.61737 46.73451 46.07767 46.77581 106.51480   100
#>     y 41.97875 42.59027 46.16944 46.06932 46.63008  67.24804   100

Created on 2023-11-30 with reprex v2.0.2

mikerspencer commented 11 months ago

That's great, thanks! I get a slightly quicker response now from the arrow var:

Unit: milliseconds

expr min lq mean median uq max neval
x 87.89557 96.76203 121.5635 117.6893 133.8556 252.2526 100
y 88.43664 101.78672 129.5208 120.7926 151.2396 281.5310 100

Getting into hardware, I suspect you're on Apple silicon with those times. It's interesting you don't see a difference between the two methods, but on my AMD machine it's now quicker with the arrow var.

paleolimbot commented 10 months ago

Slightly more minimal reprex:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

x <- runif(29500000) * 10
x_altrep <- as.vector(as_chunked_array(x))

bench::mark(
  is.na(x),
  is.na(x_altrep)
)
#> # A tibble: 2 × 6
#>   expression           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 is.na(x)          23.7ms   24.1ms     41.3      113MB     15.9
#> 2 is.na(x_altrep)  244.4ms  244.5ms      4.09     113MB      0

Created on 2024-01-03 with reprex v2.0.2

paleolimbot commented 10 months ago

I checked to make sure that nothing unexpected is happening (e.g., we had an issue before where we were materializing the entire array by accident for each call to Elt()), and nothing seems to be amiss: the underlying implementation is calling ISNAN(REAL_ELT(x)) (or similar, I didn't check) a lot of times for ALTREP objects. For us, that's very slow.

A better implementation might call REAL_GET_REGION(). If it did, the ALTREP implementation would be slower but not nearly as bad as extracting each element individually.

library(arrow, warn.conflicts = FALSE)

x <- runif(29500000) * 10
x_altrep <- as.vector(as_chunked_array(x))
.Internal(inspect(x_altrep))
#> @1064c4180 14 REALSXP g0c0 [REF(65535)] arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000

# Probably a better implementation than base R's
cpp11::cpp_function("
cpp11::logicals is_na2(cpp11::doubles x) {
    int region_size = 1024;
    R_xlen_t n = x.size();
    cpp11::writable::logicals out(n);
    cpp11::writable::doubles buf_shelter(region_size);
    double* buf = REAL(buf_shelter);
    for (R_xlen_t i = 0; i < n; i++) {
      if ((i % region_size) == 0) {
        REAL_GET_REGION(x, i, region_size, buf);
      }
      out[i] = ISNAN(buf[i % region_size]);
    }
    return out;
}                    
")

bench::mark(
  is.na(x),
  is.na(x_altrep),
  is_na2(x),
  is_na2(x_altrep)
)
#> # A tibble: 4 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 is.na(x)           23.8ms   24.1ms     40.6      113MB    14.5 
#> 2 is.na(x_altrep)   315.7ms  317.6ms      3.15     113MB     0   
#> 3 is_na2(x)          60.8ms   61.4ms     16.3      113MB     5.44
#> 4 is_na2(x_altrep)     62ms   62.3ms     16.0      113MB     5.35

# Make sure we didn't materialize
.Internal(inspect(x_altrep))
#> @1064c4180 14 REALSXP g1c0 [MARK,REF(65535)] arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000

Created on 2024-01-03 with reprex v2.0.2

It does beg the question of whether ALTREP by default is worth the trouble.