Open karoliskoncevicius opened 1 year ago
Thanks for the report. If you search this issue tracker, I think you'll find other examples. If you search the R help and devel mailing list, you'll find similar precision issues with base R functions.
We might be able to improve this, but there will always be precision issues because of floating-point arithmetic. The only safe way to compare to 1.0
is to compare with a tolerance. For example, all.equal()
for numeric values has a tolerance = sqrt(.Machine$double.eps)
argument. So, something like:
> tolerance <- sqrt(.Machine$double.eps)
> abs(rowCumsums(t(1/rep(11, 11))) - 1) < tolerance
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
I tried using
rowCumsums()
butpsmirnov()
then frequently returned NA values.
Can you please give a reproducible example with stats::psmirnov()
here? It could be that stats::psmirnov()
is problematic should be fixed.
For any developer/contribute who wish to look into this, the C-level source code is in https://github.com/HenrikBengtsson/matrixStats/blob/develop/src/rowCumsums_lowlevel_template.h. I think the culprit is that although we're adding the current cumulative sum with the next value using long double
(LDOUBLE
), we're rounding to double
in each iteration:
This can probably be improved.
FWIW, the following patch makes no difference for this use case. It still gives:
> matrixStats::rowCumsums(t(1/rep(11, 11))) - 1.0
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.9090909 -0.8181818 -0.7272727 -0.6363636 -0.5454545 -0.4545455
[,7] [,8] [,9] [,10] [,11]
[1,] -0.3636364 -0.2727273 -0.1818182 -0.09090909 2.220446e-16
$ git diff -w src/rowCumsums_lowlevel_template.h
diff --git a/src/rowCumsums_lowlevel_template.h b/src/rowCumsums_lowlevel_template.h
index 0260f92..29cb2bd 100644
--- a/src/rowCumsums_lowlevel_template.h
+++ b/src/rowCumsums_lowlevel_template.h
@@ -33,7 +33,7 @@ void CONCAT_MACROS(rowCumsums, X_C_SIGNATURE)(X_C_TYPE *x, R_xlen_t nrow, R_xlen
R_xlen_t ii, jj, kk, kk_prev, idx;
R_xlen_t colBegin;
X_C_TYPE xvalue;
- LDOUBLE value;
+ LDOUBLE value, csum;
#if ANS_TYPE == 'i'
double R_INT_MIN_d = (double)R_INT_MIN,
@@ -86,7 +86,9 @@ void CONCAT_MACROS(rowCumsums, X_C_SIGNATURE)(X_C_TYPE *x, R_xlen_t nrow, R_xlen
ans[kk] = ANS_NA;
}
#else
- ans[kk] = (ANS_C_TYPE) ((LDOUBLE) ans[kk_prev] + (LDOUBLE) xvalue);
+ if (kk_prev == 0) csum = (LDOUBLE) ans[0];
+ csum = csum + (LDOUBLE) xvalue;
+ ans[kk] = (ANS_C_TYPE) csum;
#endif
kk++;
@@ -127,7 +129,7 @@ void CONCAT_MACROS(rowCumsums, X_C_SIGNATURE)(X_C_TYPE *x, R_xlen_t nrow, R_xlen
ans[kk] = ANS_NA;
}
#else
- value += xvalue;
+ value += (LDOUBLE) xvalue;
ans[kk] = (ANS_C_TYPE) value;
#endif
Thank you for such a quick reply. Here is a somewhat artificial demonstration with psmirnov()
.
a <- cumsum(t(rep(1/21, 21)))
b <- rowCumsums(t(rep(1/21, 21)))
all.equal(as.vector(a), as.vector(b))
[1] TRUE
psmirnov(a[21], c(10,11))
[1] 0.9999943
psmirnov(b[21], c(10,11))
[1] NA
psmirnov()
is quite a new function in R so might have problems. But somehow it never gives issues with base::cumsum
:
aps <- bps <- numeric(100)
for(i in 5:100) {
a <- cumsum(t(rep(1/i, i)))
b <- rowCumsums(t(rep(1/i, i)))
aps[i] <- psmirnov(a[i], c(floor(i/2),ceiling(i/2)))
bps[i] <- psmirnov(b[i], c(floor(i/2),ceiling(i/2)))
}
sum(is.na(aps)) # how many NAs from psmirnov() with base::cumsum?
[1] 0
sum(is.na(bps)) # how many NAs from psmirnov() with rowCumsums?
[1] 39
8th line in psmirnov()
probably makes the difference:
head(psmirnov, 10)
1 function (q, sizes, z = NULL, two.sided = TRUE, exact = TRUE,
2 simulate = FALSE, B = 2000, lower.tail = TRUE, log.p = FALSE)
3 {
4 if (is.numeric(q))
5 q <- as.double(q)
6 else stop("argument 'q' must be numeric")
7 ret <- rep.int(0, length(q))
8 ret[is.na(q) | q < -1 | q > 1] <- NA
9 IND <- which(!is.na(ret))
10 if (!length(IND))
Maybe it should be fixed at psmirnov()
side. But still it seems like, aside from the general issues of numerical precision in computing, base::cumsum
and rowCumsums
do not deviate randomly from one another, but rather base::cumsum
seems to be more accurate.
More notes for clarifications. The same precision issue happens if we calculate the cumulative sum manually using vanilla addition in R;
my_cumsum <- function(x) {
y <- double(length(x))
y[1] <- x[1]
for (kk in seq(from = 2, to =length(x))) y[kk] <- y[kk-1] + x[kk]
y
}
For example:
x <- rep(1/11, times = 11L)
y1 <- cumsum(x)
y1 - 1
## [1] -0.90909091 -0.81818182 -0.72727273 -0.63636364 -0.54545455
## [6] -0.45454545 -0.36363636 -0.27272727 -0.18181818 -0.09090909
## [11] 0.00000000
y2 <- my_cumsum(x)
y2 - 1
## [1] -9.090909e-01 -8.181818e-01 -7.272727e-01 -6.363636e-01 -5.454545e-01
## [6] -4.545455e-01 -3.636364e-01 -2.727273e-01 -1.818182e-01 -9.090909e-02
## [11] 2.220446e-16
y2-y1
## [1] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [6] 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16
## [11] 2.220446e-16
y3 <- as.vector(matrixStats::rowCumsums(x, dim. = c(1, length(x))))
y3 - 1
## [1] -9.090909e-01 -8.181818e-01 -7.272727e-01 -6.363636e-01 -5.454545e-01
## [6] -4.545455e-01 -3.636364e-01 -2.727273e-01 -1.818182e-01 -9.090909e-02
## [11] 2.220446e-16
y3-y1
## [1] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [6] 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16
## [11] 2.220446e-16
y3-y2
## [1] 0 0 0 0 0 0 0 0 0 0 0
Since y1 != y2
and both can be considered correct ways of calculating the cumulative sum, I think one could argue that psmirnov()
is too sensitive to "numeric noise".
Here's an alternative version that increase the precision by using sum()
instead of +
:
my_cumsum2 <- function(x) {
y <- double(length(x))
y[1] <- x[1]
for (kk in seq(from = 2, to = length(x))) {
tmp <- x[seq_len(kk)]
y[kk] <- sum(tmp)
}
y
}
> y4 <- my_cumsum2(x)
> y4-1
[1] -0.90909091 -0.81818182 -0.72727273 -0.63636364 -0.54545455 -0.45454545
[7] -0.36363636 -0.27272727 -0.18181818 -0.09090909 0.00000000
> y4-y1
[1] 0 0 0 0 0 0 0 0 0 0 0
Compared with
base::cumsum()
rowCumsums()
seems to be less precise with floating point operations. As an example - calculating cumulative sums of repeated fractions that all should sum to one in the end. Every timecumsum()
from base R returns 1 as a final result. And most of the time the result ofrowCumsums()
is either slightly below or slightly above 1.This can have practical consequences: recently I was trying to speed up kolmogorov-smirnov test calculations and the statistic requires to calculate cumulative sums. I tried using
rowCumsums()
butpsmirnov()
then frequently returned NA values. Trouble is - the statistic should be bounded between 0 and 1, androwCumsums()
gets numbers that are tiny bit below zero or above 1.