As we have discussed previously, rowSums2() has been less performant than colSums2() due to caching issues. Therefore, I have now rewritten the C source code for rowSums2() (and consequently for colSums2() as they use the same underlying C function). Iteration over the data matrix now happens in the order it is stored in memory, even for row sums.
This new approach has two major advantages:
Benchmarks (see below) have shown that rowSums2() is now as fast, if not faster than, colSums2().
The internal C API can be simplified, the old API switched the meaning of columns and rows for colSums2() in order to use the same code as for rowSums2() which made the code far harder to reason about. In the suggested version, rows and columns really mean rows and columns in the internal C code.
The only downsides I can see in this new version are:
Higher memory usage for rowSums2() (but not for colSums2()). However, I don't think the memory overhead is an issue here because it is far smaller than the size of the data matrix unless if is very small or very tall.
The internal C API for rowSums2() is structured differently than the other functions, which may lead to confusion.
As for the benchmarks, I have been running with R version 4.3.2 under macOS Monterey 12.7.5 with Apple clang version 14.0.0 (clang-1400.0.29.202) on an Intel(R) Core(TM) i5-5257U. I have used the default compiler flags, in particular -O2. Under this setup, LDOUBLE resolves to double, not long double.
As we have discussed previously,
rowSums2()
has been less performant thancolSums2()
due to caching issues. Therefore, I have now rewritten the C source code forrowSums2()
(and consequently forcolSums2()
as they use the same underlying C function). Iteration over the data matrix now happens in the order it is stored in memory, even for row sums.This new approach has two major advantages:
rowSums2()
is now as fast, if not faster than,colSums2()
.colSums2()
in order to use the same code as forrowSums2()
which made the code far harder to reason about. In the suggested version, rows and columns really mean rows and columns in the internal C code.The only downsides I can see in this new version are:
rowSums2()
(but not forcolSums2()
). However, I don't think the memory overhead is an issue here because it is far smaller than the size of the data matrix unless if is very small or very tall.rowSums2()
is structured differently than the other functions, which may lead to confusion.As for the benchmarks, I have been running with R version 4.3.2 under macOS Monterey 12.7.5 with Apple clang version 14.0.0 (clang-1400.0.29.202) on an Intel(R) Core(TM) i5-5257U. I have used the default compiler flags, in particular
-O2
. Under this setup,LDOUBLE
resolves todouble
, notlong double
.