Closed jphill01 closed 5 years ago
Not near a computer for a few days, but I suspect the drop(x)
is what takes time, not colQuantiles()
. Try adding a solo drop(x)
entry to the benchmark to see this. Or drop that dimension before benchmarking.
Forgot to say, colQuantiles()
is one of the functions that still is not implemented in native code.
Thanks. Any idea at what point colQuantiles() will be optimized?
Since it's most likely drop(x)
culprit here, a native implementation is unlikely to make a big difference.
Note, matrixStats is designed for matrices (2d arrays) and some vectors (1d arrays). Your example uses a 3d array.
Alright, thanks. I guess I will have to find a faster version of drop(), or wait until an 'arrayStats' package is released...
Now at a computer. My guess above was incorrect; it's not drop()
that is the culprit. I expected it to slow things down due to memory copying. Here are some benchmarks confirming this:
library(matrixStats)
x <- array(1:(10000*1000), dim = c(10000, 1000, 1))
xd <- drop(x)
y0 <- apply(x, MARGIN = 2L, FUN = quantile, probs = 0.975)
y1 <- colQuantiles(drop(x), probs = 0.975)
y2 <- colQuantiles(xd, probs = 0.975)
stopifnot(identical(y1, y0), identical(y2, y0))
stats <- microbenchmark::microbenchmark(
apply(x, MARGIN = 2L, FUN = quantile, probs = 0.975),
colQuantiles(drop(x), probs = 0.975),
colQuantiles(xd, probs = 0.975),
drop(x),
unit = "ms",
times = 10L
)
print(stats)
## Unit: milliseconds
## expr min lq
## apply(x, MARGIN = 2L, FUN = quantile, probs = 0.975) 142.430746 143.457675
## colQuantiles(drop(x), probs = 0.975) 286.959245 363.392250
## colQuantiles(xd, probs = 0.975) 264.892077 303.891283
## drop(x) 0.000419 0.000782
## mean median uq max neval cld
## 165.3461039 150.036045 170.998975 267.598677 10 b
## 399.3671685 384.558625 457.812438 481.906615 10 d
## 345.1908052 364.407124 384.364789 389.169873 10 c
## 0.0029067 0.002321 0.002681 0.011392 10 a
In other words, there's indeed room for improvement.
Thanks again! My original code has been released to CRAN as part of an R package.
It's sufficiently fast for now, but whenever you get around to optimizing colQuantiles(), I will use that instead of stats::quantile(), which can be improved based on code profiling.
Benchmarking with
library(matrixStats)
options(width=120)
X <- matrix(1:(10000*1000), nrow=10000L, ncol=1000L)
y0 <- apply(X, MARGIN=2L, FUN=quantile, probs=0.975)
y1 <- colQuantiles(X, probs=0.975)
stopifnot(identical(y1, y0))
stats <- bench::mark(
apply(X, MARGIN=2L, FUN=quantile, probs=0.975),
colQuantiles(X, probs=0.975),
min_iterations=10L
)
It does not like replacing generic sort()
with sort.int()
(Issue #155) makes a difference, i.e. that's not the culprit:
## matrixStats 0.55.0
print(stats)
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
## <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
## 1 apply(X, MARGIN = 2L, FUN = quantile, probs = 0.975) 146ms 155ms 6.03 191MB 19.9 10 33 1.66s
## 2 colQuantiles(X, probs = 0.975) 286ms 310ms 3.18 496MB 20.1 10 63 3.14s
## # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
## matrixStats 0.55.0-9000 (with sort.int() instead of generic sort())
print(stats)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 apply(X, MARGIN = 2L, FUN = quantile, probs = 0.975) 151ms 174ms 5.38 191MB 17.7 10 33 1.86s
2 colQuantiles(X, probs = 0.975) 292ms 333ms 2.89 496MB 23.4 10 81 3.46s
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Now, looking at the above benchmark results, it's clear that colQuantiles()
does 2.6x times more memory allocations. That needs to be investigated.
The dominant memory allocations are:
> p1 <- profmem::profmem(y1 <- colQuantiles(X, probs=0.975))
> subset(p1, bytes > 100e3)
Rprofmem memory profiling of:
y1 <- colQuantiles(X, probs = 0.975)
Memory allocations:
what bytes calls
4 alloc 40000048 colQuantiles() -> apply() -> aperm() -> aperm.default()
4042 alloc 40000048 colQuantiles() -> apply() -> unlist()
4043 alloc 40000048 colQuantiles() -> apply() -> array()
4044 alloc 40000048 colQuantiles() -> apply() -> aperm() -> aperm.default()
6051 alloc 40000048 colQuantiles() -> apply() -> aperm() -> aperm.default()
total 200000240
Great news: In matrixStats 0.55.0-9000 (develop branch) we now have:
PERFORMANCE:
- colQuantiles() and rowQuantiles() with the default type=7L and when there are no missing values is now significantly faster and uses significantly fewer memory allocations.
For the example in this issue, the speedup is ~7 times (compared to https://github.com/HenrikBengtsson/matrixStats/issues/153#issuecomment-530059331), which means colQuantiles()
is now significantly faster than apply(..., MARGIN=2L, FUN=stats::quantile)
(here ~3 times)
> library(matrixStats)
> options(width=120)
>
> X <- matrix(1:(10000*1000), nrow=10000L, ncol=1000L)
> y0 <- apply(X, MARGIN=2L, FUN=quantile, probs=0.975)
> y1 <- colQuantiles(X, probs=0.975)
> stopifnot(identical(y1, y0))
>
> stats <- bench::mark(
+ apply(X, MARGIN=2L, FUN=quantile, probs=0.975),
+ colQuantiles(X, probs=0.975),
+ min_iterations=10L
+ )
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
> stats
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl>
1 apply(X, MARGIN = 2L, FUN = quantile, probs = 0.975) 209.6ms 218.6ms 4.56 191MB 20.5 10 45
2 colQuantiles(X, probs = 0.975) 68.5ms 70.4ms 13.7 115MB 28.8 10 21
# … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>, gc <list>
This was achieved by avoid lots of large memory allocation. Comparing to https://github.com/HenrikBengtsson/matrixStats/issues/153#issuecomment-530059331, there large allocations are no longer done:
> p1 <- profmem::profmem(y1 <- colQuantiles(X, probs=0.975))
> subset(p1, bytes > 50e3)
Rprofmem memory profiling of:
y1 <- colQuantiles(X, probs = 0.975)
Memory allocations:
what bytes calls
total 0
PS. There's still some room for improvements, but that's minor to what was achieved here.
Thanks! Do you know when you plan to submit an updated package to CRAN?
No guesstimate. matrixStats 0.55.0 was just released and it was a multi-day effort to check it across platforms and run all of the 244 reverse dependency check (I have access to computer cluster so that helped). So, releasing matrixStats is a bit of a time commitment.
FYI, matrixStats 0.56.0 with this speedup is now on CRAN.
Excellent! Thanks Henrik!
Computing quantiles via apply() across columns (MARGIN = 2) appears to be faster than using colQuantiles(). How can this be? Since colQuantiles is written in pure C, it should be much faster than the pure R apply(), but it's actually the other way around...
library(matrixStats) library(microbenchmark)
x <- array(1:(10000 * 1000), dim = c(10000, 1000, 1)) # example array
microbenchmark(apply(x, 2, function(x) quantile(x, 0.975)), colQuantiles(drop(x), probs = 0.975)) # timing each function
Unit: milliseconds expr min lq mean median apply(x, 2, function(x) quantile(x, 0.975)) 252.7643 263.4260 278.0485 268.7132 colQuantiles(drop(x), probs = 0.975) 447.6436 514.2717 537.6342 526.2200 uq max neval 281.7958 356.3389 100 573.9136 615.9201 100
identical(apply(x, 2, function(x) quantile(x, 0.975)), colQuantiles(drop(x), probs = 0.975)) [1] TRUE
Can someone explain?