HenrikBengtsson / matrixStats

R package: Methods that Apply to Rows and Columns of Matrices (and to Vectors)
https://cran.r-project.org/package=matrixStats
203 stars 33 forks source link

In-order iteration for rowSums2() #261

Closed yaccos closed 3 months ago

yaccos commented 3 months ago

As we have discussed previously, rowSums2() has been less performant than colSums2() due to caching issues. Therefore, I have now rewritten the C source code for rowSums2() (and consequently for colSums2() as they use the same underlying C function). Iteration over the data matrix now happens in the order it is stored in memory, even for row sums.

This new approach has two major advantages:

The only downsides I can see in this new version are:

As for the benchmarks, I have been running with R version 4.3.2 under macOS Monterey 12.7.5 with Apple clang version 14.0.0 (clang-1400.0.29.202) on an Intel(R) Core(TM) i5-5257U. I have used the default compiler flags, in particular -O2. Under this setup, LDOUBLE resolves to double, not long double.

library(matrixStats)
library(bench)

benchmark_function <- function(nr, nc) {
  x <- matrix(rnorm(nr*nc), nrow = nr, ncol = nc)
  x_t <- t(x)
  br <- mark(rowSums = rowSums(x),
             rowSums2 = rowSums2(x),
             colSums = colSums(x_t),
             colSums2 = colSums2(x_t)
             )
  br[, 1:7]
}

for(n in c(10L,100L,1000L,10000L)){
  message(sprintf("n=%d",n))
  br <- benchmark_function(n,n)
  print(br)
}

# Current matrixStats devel
n=10
# A tibble: 4 × 7
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 rowSums      6.37µs   7.06µs   126036.        0B     12.6  9999
2 rowSums2     3.31µs   3.85µs   204957.   11.02KB     20.5  9999
3 colSums      6.32µs   6.94µs   128198.        0B     12.8  9999
4 colSums2     3.32µs    3.9µs   218422.    8.55KB      0   10000
n=100
# A tibble: 4 × 7
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 rowSums      27.4µs   28.3µs    32960.      848B     3.30  9999
2 rowSums2     15.7µs   16.3µs    58609.      848B     0    10000
3 colSums      13.1µs   13.9µs    68872.      848B     6.89  9999
4 colSums2     14.9µs   15.4µs    61054.      848B     6.11  9999
n=1000
# A tibble: 4 × 7
  expression       min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 rowSums        2.1ms   2.12ms      468.    7.86KB        0   234
2 rowSums2      2.41ms   2.56ms      318.    7.86KB        0   159
3 colSums     999.57µs   1.01ms      919.    7.86KB        0   460
4 colSums2      1.05ms   1.07ms      863.    7.86KB        0   432
n=10000
# A tibble: 4 × 7
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 rowSums     225.6ms 226.64ms     4.40     78.2KB        0     3
2 rowSums2      1.54s    1.54s     0.649    78.2KB        0     1
3 colSums    101.23ms 103.51ms     9.66     78.2KB        0     5
4 colSums2   102.25ms 110.63ms     9.09     78.2KB        0     5

# This PR
n=10
# A tibble: 4 × 7
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 rowSums      6.61µs   9.19µs    88755.        0B     8.88  9999
2 rowSums2      3.5µs   3.89µs   242798.   11.02KB     0    10000
3 colSums      6.32µs   6.97µs   132089.        0B    13.2   9999
4 colSums2     3.49µs      4µs   220780.    8.55KB    22.1   9999
n=100
# A tibble: 4 × 7
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 rowSums      27.6µs   28.4µs    28319.      848B     2.83  9999
2 rowSums2     13.7µs   14.3µs    60658.    1.66KB     6.07  9999
3 colSums      13.3µs   14.1µs    47771.      848B     4.78  9999
4 colSums2     14.6µs   15.9µs    33561.      848B     0    10000
n=1000
# A tibble: 4 × 7
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 rowSums      2.09ms   2.12ms      458.    7.86KB        0   229
2 rowSums2   966.35µs 988.48µs      912.   15.73KB        0   456
3 colSums         1ms   1.02ms      881.    7.86KB        0   441
4 colSums2     1.04ms   1.06ms      920.    7.86KB        0   461
n=10000
# A tibble: 4 × 7
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
1 rowSums     213.7ms  220.9ms      4.50    78.2KB        0     3
2 rowSums2     93.4ms   95.3ms     10.4    156.4KB        0     6
3 colSums      99.9ms  100.4ms      9.92    78.2KB        0     5
4 colSums2    101.2ms  101.7ms      9.52    78.2KB        0     5
HenrikBengtsson commented 3 months ago

Thanks a bunch. Looks good to me.