[R] altrep data type slows down evaluation

mikerspencer commented 11 months ago

Describe the bug, including details regarding any error messages, version, and platform.

When making evaluations like checking for NA the altrep datatype slows calculation by approx four times. Tested in arrow 10, 12 & 14 on Ubuntu.

library(arrow)
library(dplyr)

# generate data
x = runif(29500000) * 10
d = data.frame(cv = x)
write_dataset(d, "/tmp/data.arrow")
# then read back
df = open_dataset("/tmp/data.arrow/") %>% select(cv) %>% collect()
x = df$cv
y = x + 0

identical(x, y)
microbenchmark::microbenchmark(x={sum(is.na(x))}, y={sum(is.na(y))})

Results:

Unit: milliseconds	expr	min	lq	mean	median	uq	max neval
x	291.8	302.2	348.8	310.2	348.8	754.8	100
y	85.3	87.2	108.8	89.3	133.4	225.4	100

With thanks to Barry for the reprex https://mastodon.scot/@geospacedman@mastodon.social/111450704657241188

Component(s)

R

paleolimbot commented 11 months ago

Thanks for opening the issue, and thanks for the reprex!

It is true that ALTREP objects generally perform more slowly than non-ALTREP objects, although I wouldn't have expected this particular operation to be that much slower.

I will dig into this, but in the meantime, you can turn ALTREP off using options(arrow.use_altrep = FALSE):

library(arrow)
library(dplyr)
options(arrow.use_altrep = FALSE)

# generate data
x = runif(29500000) * 10
d = data.frame(cv = x)
write_dataset(d, "/tmp/data.arrow")
# then read back
df = open_dataset("/tmp/data.arrow/") %>% select(cv) %>% collect()
x = df$cv
y = x + 0

identical(x, y)
#> [1] TRUE
microbenchmark::microbenchmark(x={sum(is.na(x))}, y={sum(is.na(y))})
#> Warning in microbenchmark::microbenchmark(x = {: less accurate nanosecond times
#> to avoid potential integer overflows
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq       max neval
#>     x 41.99819 42.61737 46.73451 46.07767 46.77581 106.51480   100
#>     y 41.97875 42.59027 46.16944 46.06932 46.63008  67.24804   100

^{Created on 2023-11-30 with reprex v2.0.2}

mikerspencer commented 11 months ago

That's great, thanks! I get a slightly quicker response now from the arrow var:

Unit: milliseconds

expr	min	lq	mean	median	uq	max	neval
x	87.89557	96.76203	121.5635	117.6893	133.8556	252.2526	100
y	88.43664	101.78672	129.5208	120.7926	151.2396	281.5310	100

Getting into hardware, I suspect you're on Apple silicon with those times. It's interesting you don't see a difference between the two methods, but on my AMD machine it's now quicker with the arrow var.

paleolimbot commented 10 months ago

Slightly more minimal reprex:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

x <- runif(29500000) * 10
x_altrep <- as.vector(as_chunked_array(x))

bench::mark(
  is.na(x),
  is.na(x_altrep)
)
#> # A tibble: 2 × 6
#>   expression           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 is.na(x)          23.7ms   24.1ms     41.3      113MB     15.9
#> 2 is.na(x_altrep)  244.4ms  244.5ms      4.09     113MB      0

^{Created on 2024-01-03 with reprex v2.0.2}

paleolimbot commented 10 months ago

I checked to make sure that nothing unexpected is happening (e.g., we had an issue before where we were materializing the entire array by accident for each call to Elt()), and nothing seems to be amiss: the underlying implementation is calling ISNAN(REAL_ELT(x)) (or similar, I didn't check) a lot of times for ALTREP objects. For us, that's very slow.

A better implementation might call REAL_GET_REGION(). If it did, the ALTREP implementation would be slower but not nearly as bad as extracting each element individually.

library(arrow, warn.conflicts = FALSE)

x <- runif(29500000) * 10
x_altrep <- as.vector(as_chunked_array(x))
.Internal(inspect(x_altrep))
#> @1064c4180 14 REALSXP g0c0 [REF(65535)] arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000

# Probably a better implementation than base R's
cpp11::cpp_function("
cpp11::logicals is_na2(cpp11::doubles x) {
    int region_size = 1024;
    R_xlen_t n = x.size();
    cpp11::writable::logicals out(n);
    cpp11::writable::doubles buf_shelter(region_size);
    double* buf = REAL(buf_shelter);
    for (R_xlen_t i = 0; i < n; i++) {
      if ((i % region_size) == 0) {
        REAL_GET_REGION(x, i, region_size, buf);
      }
      out[i] = ISNAN(buf[i % region_size]);
    }
    return out;
}                    
")

bench::mark(
  is.na(x),
  is.na(x_altrep),
  is_na2(x),
  is_na2(x_altrep)
)
#> # A tibble: 4 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 is.na(x)           23.8ms   24.1ms     40.6      113MB    14.5 
#> 2 is.na(x_altrep)   315.7ms  317.6ms      3.15     113MB     0   
#> 3 is_na2(x)          60.8ms   61.4ms     16.3      113MB     5.44
#> 4 is_na2(x_altrep)     62ms   62.3ms     16.0      113MB     5.35

# Make sure we didn't materialize
.Internal(inspect(x_altrep))
#> @1064c4180 14 REALSXP g1c0 [MARK,REF(65535)] arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000

^{Created on 2024-01-03 with reprex v2.0.2}

It does beg the question of whether ALTREP by default is worth the trouble.

apache / arrow

[R] altrep data type slows down evaluation #39004

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)