Open mikerspencer opened 11 months ago
Thanks for opening the issue, and thanks for the reprex!
It is true that ALTREP objects generally perform more slowly than non-ALTREP objects, although I wouldn't have expected this particular operation to be that much slower.
I will dig into this, but in the meantime, you can turn ALTREP off using options(arrow.use_altrep = FALSE)
:
library(arrow)
library(dplyr)
options(arrow.use_altrep = FALSE)
# generate data
x = runif(29500000) * 10
d = data.frame(cv = x)
write_dataset(d, "/tmp/data.arrow")
# then read back
df = open_dataset("/tmp/data.arrow/") %>% select(cv) %>% collect()
x = df$cv
y = x + 0
identical(x, y)
#> [1] TRUE
microbenchmark::microbenchmark(x={sum(is.na(x))}, y={sum(is.na(y))})
#> Warning in microbenchmark::microbenchmark(x = {: less accurate nanosecond times
#> to avoid potential integer overflows
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> x 41.99819 42.61737 46.73451 46.07767 46.77581 106.51480 100
#> y 41.97875 42.59027 46.16944 46.06932 46.63008 67.24804 100
Created on 2023-11-30 with reprex v2.0.2
That's great, thanks! I get a slightly quicker response now from the arrow var:
Unit: milliseconds
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
x | 87.89557 | 96.76203 | 121.5635 | 117.6893 | 133.8556 | 252.2526 | 100 |
y | 88.43664 | 101.78672 | 129.5208 | 120.7926 | 151.2396 | 281.5310 | 100 |
Getting into hardware, I suspect you're on Apple silicon with those times. It's interesting you don't see a difference between the two methods, but on my AMD machine it's now quicker with the arrow var.
Slightly more minimal reprex:
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
x <- runif(29500000) * 10
x_altrep <- as.vector(as_chunked_array(x))
bench::mark(
is.na(x),
is.na(x_altrep)
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 is.na(x) 23.7ms 24.1ms 41.3 113MB 15.9
#> 2 is.na(x_altrep) 244.4ms 244.5ms 4.09 113MB 0
Created on 2024-01-03 with reprex v2.0.2
I checked to make sure that nothing unexpected is happening (e.g., we had an issue before where we were materializing the entire array by accident for each call to Elt()
), and nothing seems to be amiss: the underlying implementation is calling ISNAN(REAL_ELT(x))
(or similar, I didn't check) a lot of times for ALTREP objects. For us, that's very slow.
A better implementation might call REAL_GET_REGION()
. If it did, the ALTREP implementation would be slower but not nearly as bad as extracting each element individually.
library(arrow, warn.conflicts = FALSE)
x <- runif(29500000) * 10
x_altrep <- as.vector(as_chunked_array(x))
.Internal(inspect(x_altrep))
#> @1064c4180 14 REALSXP g0c0 [REF(65535)] arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000
# Probably a better implementation than base R's
cpp11::cpp_function("
cpp11::logicals is_na2(cpp11::doubles x) {
int region_size = 1024;
R_xlen_t n = x.size();
cpp11::writable::logicals out(n);
cpp11::writable::doubles buf_shelter(region_size);
double* buf = REAL(buf_shelter);
for (R_xlen_t i = 0; i < n; i++) {
if ((i % region_size) == 0) {
REAL_GET_REGION(x, i, region_size, buf);
}
out[i] = ISNAN(buf[i % region_size]);
}
return out;
}
")
bench::mark(
is.na(x),
is.na(x_altrep),
is_na2(x),
is_na2(x_altrep)
)
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 is.na(x) 23.8ms 24.1ms 40.6 113MB 14.5
#> 2 is.na(x_altrep) 315.7ms 317.6ms 3.15 113MB 0
#> 3 is_na2(x) 60.8ms 61.4ms 16.3 113MB 5.44
#> 4 is_na2(x_altrep) 62ms 62.3ms 16.0 113MB 5.35
# Make sure we didn't materialize
.Internal(inspect(x_altrep))
#> @1064c4180 14 REALSXP g1c0 [MARK,REF(65535)] arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000
Created on 2024-01-03 with reprex v2.0.2
It does beg the question of whether ALTREP by default is worth the trouble.
Describe the bug, including details regarding any error messages, version, and platform.
When making evaluations like checking for NA the altrep datatype slows calculation by approx four times. Tested in arrow 10, 12 & 14 on Ubuntu.
Results:
With thanks to Barry for the reprex https://mastodon.scot/@geospacedman@mastodon.social/111450704657241188
Component(s)
R