Improving precision of rowCumsums() ?

HenrikBengtsson / matrixStats

R package: Methods that Apply to Rows and Columns of Matrices (and to Vectors)

https://cran.r-project.org/package=matrixStats

203 stars 33 forks source link

Improving precision of rowCumsums() ? #230

Open karoliskoncevicius opened 1 year ago

karoliskoncevicius commented 1 year ago

Compared with base::cumsum() rowCumsums() seems to be less precise with floating point operations. As an example - calculating cumulative sums of repeated fractions that all should sum to one in the end. Every time cumsum() from base R returns 1 as a final result. And most of the time the result of rowCumsums() is either slightly below or slightly above 1.

rowCumsums(t(1/rep(11, 11))) == 1
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

cumsum(t(1/rep(11, 11))) == 1
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

This can have practical consequences: recently I was trying to speed up kolmogorov-smirnov test calculations and the statistic requires to calculate cumulative sums. I tried using rowCumsums() but psmirnov() then frequently returned NA values. Trouble is - the statistic should be bounded between 0 and 1, and rowCumsums() gets numbers that are tiny bit below zero or above 1.

HenrikBengtsson commented 1 year ago

Thanks for the report. If you search this issue tracker, I think you'll find other examples. If you search the R help and devel mailing list, you'll find similar precision issues with base R functions.

We might be able to improve this, but there will always be precision issues because of floating-point arithmetic. The only safe way to compare to 1.0 is to compare with a tolerance. For example, all.equal() for numeric values has a tolerance = sqrt(.Machine$double.eps) argument. So, something like:

> tolerance <- sqrt(.Machine$double.eps)
> abs(rowCumsums(t(1/rep(11, 11))) - 1) < tolerance
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

I tried using rowCumsums() but psmirnov() then frequently returned NA values.

Can you please give a reproducible example with stats::psmirnov() here? It could be that stats::psmirnov() is problematic should be fixed.

HenrikBengtsson commented 1 year ago

For any developer/contribute who wish to look into this, the C-level source code is in https://github.com/HenrikBengtsson/matrixStats/blob/develop/src/rowCumsums_lowlevel_template.h. I think the culprit is that although we're adding the current cumulative sum with the next value using long double (LDOUBLE), we're rounding to double in each iteration:

https://github.com/HenrikBengtsson/matrixStats/blob/c6a49a0b91e497035ec3ebb6b8b76854c081412e/src/rowCumsums_lowlevel_template.h#L89

This can probably be improved.

HenrikBengtsson commented 1 year ago

FWIW, the following patch makes no difference for this use case. It still gives:

> matrixStats::rowCumsums(t(1/rep(11, 11))) - 1.0
           [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
[1,] -0.9090909 -0.8181818 -0.7272727 -0.6363636 -0.5454545 -0.4545455
           [,7]       [,8]       [,9]       [,10]        [,11]
[1,] -0.3636364 -0.2727273 -0.1818182 -0.09090909 2.220446e-16

$ git diff -w src/rowCumsums_lowlevel_template.h
diff --git a/src/rowCumsums_lowlevel_template.h b/src/rowCumsums_lowlevel_template.h
index 0260f92..29cb2bd 100644
--- a/src/rowCumsums_lowlevel_template.h
+++ b/src/rowCumsums_lowlevel_template.h
@@ -33,7 +33,7 @@ void CONCAT_MACROS(rowCumsums, X_C_SIGNATURE)(X_C_TYPE *x, R_xlen_t nrow, R_xlen
   R_xlen_t ii, jj, kk, kk_prev, idx;
   R_xlen_t colBegin;
   X_C_TYPE xvalue;
-  LDOUBLE value;
+  LDOUBLE value, csum;

 #if ANS_TYPE == 'i'
   double R_INT_MIN_d = (double)R_INT_MIN,
@@ -86,7 +86,9 @@ void CONCAT_MACROS(rowCumsums, X_C_SIGNATURE)(X_C_TYPE *x, R_xlen_t nrow, R_xlen
           ans[kk] = ANS_NA;
         }
 #else
-        ans[kk] = (ANS_C_TYPE) ((LDOUBLE) ans[kk_prev] + (LDOUBLE) xvalue);
+        if (kk_prev == 0) csum = (LDOUBLE) ans[0];
+        csum = csum + (LDOUBLE) xvalue;
+        ans[kk] = (ANS_C_TYPE) csum;
 #endif

         kk++;
@@ -127,7 +129,7 @@ void CONCAT_MACROS(rowCumsums, X_C_SIGNATURE)(X_C_TYPE *x, R_xlen_t nrow, R_xlen
           ans[kk] = ANS_NA;
         }
 #else
-        value += xvalue;
+        value += (LDOUBLE) xvalue;
         ans[kk] = (ANS_C_TYPE) value;
 #endif

karoliskoncevicius commented 1 year ago

Thank you for such a quick reply. Here is a somewhat artificial demonstration with psmirnov().

a <- cumsum(t(rep(1/21, 21)))                                                                                                                                                                                                       
b <- rowCumsums(t(rep(1/21, 21)))

all.equal(as.vector(a), as.vector(b))
[1] TRUE

psmirnov(a[21], c(10,11))                                                                                                                                                                                                           
[1] 0.9999943                                                                                                                                                                                                                         
psmirnov(b[21], c(10,11))                                                                                                                                                                                                           
[1] NA

psmirnov() is quite a new function in R so might have problems. But somehow it never gives issues with base::cumsum:

aps <- bps <- numeric(100)
for(i in 5:100) {
  a <- cumsum(t(rep(1/i, i)))
  b <- rowCumsums(t(rep(1/i, i)))

  aps[i] <- psmirnov(a[i], c(floor(i/2),ceiling(i/2)))
  bps[i] <- psmirnov(b[i], c(floor(i/2),ceiling(i/2)))
}

sum(is.na(aps))  # how many NAs from psmirnov() with base::cumsum?
[1] 0
sum(is.na(bps))  # how many NAs from psmirnov() with rowCumsums?
[1] 39

karoliskoncevicius commented 1 year ago

8th line in psmirnov() probably makes the difference:

head(psmirnov, 10)

1  function (q, sizes, z = NULL, two.sided = TRUE, exact = TRUE,
2      simulate = FALSE, B = 2000, lower.tail = TRUE, log.p = FALSE) 
3  {
4      if (is.numeric(q))
5          q <- as.double(q)
6      else stop("argument 'q' must be numeric")
7      ret <- rep.int(0, length(q))
8      ret[is.na(q) | q < -1 | q > 1] <- NA
9      IND <- which(!is.na(ret))
10     if (!length(IND))

Maybe it should be fixed at psmirnov()side. But still it seems like, aside from the general issues of numerical precision in computing, base::cumsum and rowCumsums do not deviate randomly from one another, but rather base::cumsum seems to be more accurate.

HenrikBengtsson commented 1 year ago

More notes for clarifications. The same precision issue happens if we calculate the cumulative sum manually using vanilla addition in R;

my_cumsum <- function(x) {
  y <- double(length(x))
  y[1] <- x[1]
  for (kk in seq(from = 2, to =length(x))) y[kk] <- y[kk-1] + x[kk]
  y
}

For example:

x <- rep(1/11, times = 11L)

y1 <- cumsum(x)
y1 - 1
##  [1] -0.90909091 -0.81818182 -0.72727273 -0.63636364 -0.54545455
##  [6] -0.45454545 -0.36363636 -0.27272727 -0.18181818 -0.09090909
## [11]  0.00000000

y2 <- my_cumsum(x)
y2 - 1
##  [1] -9.090909e-01 -8.181818e-01 -7.272727e-01 -6.363636e-01 -5.454545e-01
##  [6] -4.545455e-01 -3.636364e-01 -2.727273e-01 -1.818182e-01 -9.090909e-02
## [11]  2.220446e-16

y2-y1
##  [1] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##  [6] 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16
## [11] 2.220446e-16

y3 <- as.vector(matrixStats::rowCumsums(x, dim. = c(1, length(x))))
y3 - 1
##  [1] -9.090909e-01 -8.181818e-01 -7.272727e-01 -6.363636e-01 -5.454545e-01
##  [6] -4.545455e-01 -3.636364e-01 -2.727273e-01 -1.818182e-01 -9.090909e-02
## [11]  2.220446e-16
y3-y1
##  [1] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##  [6] 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16 1.110223e-16
## [11] 2.220446e-16
y3-y2
##  [1] 0 0 0 0 0 0 0 0 0 0 0

Since y1 != y2 and both can be considered correct ways of calculating the cumulative sum, I think one could argue that psmirnov() is too sensitive to "numeric noise".

HenrikBengtsson commented 1 year ago

Here's an alternative version that increase the precision by using sum() instead of +:

my_cumsum2 <- function(x) {
  y <- double(length(x))
  y[1] <- x[1]
  for (kk in seq(from = 2, to = length(x))) {
    tmp <- x[seq_len(kk)]
    y[kk] <- sum(tmp)
  }
  y
}

> y4 <- my_cumsum2(x)
> y4-1
 [1] -0.90909091 -0.81818182 -0.72727273 -0.63636364 -0.54545455 -0.45454545
 [7] -0.36363636 -0.27272727 -0.18181818 -0.09090909  0.00000000
> y4-y1
 [1] 0 0 0 0 0 0 0 0 0 0 0