HenrikBengtsson / matrixStats

R package: Methods that Apply to Rows and Columns of Matrices (and to Vectors)
https://cran.r-project.org/package=matrixStats
203 stars 33 forks source link

rowQuantiles and na.rm not working as expected with zoo objects #92

Closed mikebesso closed 7 years ago

mikebesso commented 7 years ago

Please note that I am still a novice when it comes to R. Any help you could provide would be very much appreciated.

I am creating some technical indicators against historical stock data. Here is a slice of my zoo object:

# Recreate zoo object with first row containing one NA for column UNG
x <-
structure(
  c(
    88.5871426057088, 81.5169354118093, 71.9346977827734, 
    NA, 341.197285767058, 76.434226191296, 
    49.2172139505476, 57.2595632037802, 58.9934290713818
  ), 
  .Dim = c(3L, 3L), 
  .Dimnames = list(
    NULL, 
    c("SPY", "UNG", "XLE")
  ), 
  index = structure(c(13621, 13622, 13623), class = "Date"), 
  class = "zoo"
)

This slice looks like:

                SPY       UNG      XLE
2007-04-18 88.58714        NA 49.21721
2007-04-19 81.51694 341.19729 57.25956
2007-04-20 71.93470  76.43423 58.99343

The following

rowQuantiles(x, probs = c(0.25, 0.75), na.rm = FALSE);

as expected returns

          25%       75%
[1,]       NA        NA
[2,] 69.38825 211.35711
[3,] 65.46406  74.18446

but,

rowQuantiles(x, probs = c(0.25, 0.75), na.rm = TRUE);

generates an error:

Error in quantile.default(coredata(x), ...) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE 

When I took a peak at the code in rowQuantiles.R, I was surprised to see, in line 114) that na.rm is always passed into quantile as FALSE.

q[kk,] <- quantile(xkk, probs=probs, na.rm=FALSE, type=type, ...)

That said, I see that there is quite a bit of code above this to detect/handle NAs. So, I am not sure if I fully understand what is going on just yet.

Any help you can provide would be greatly appreciated.

Thanks

HenrikBengtsson commented 7 years ago

Thanks. I can reproduce this. This seems to be a conflict with the zoo package. If you do the same without loading zoo, it works:

> ("zoo" %in% loadedNamespaces())
[1] FALSE
> rowQuantiles(x, probs = c(0.25, 0.75), na.rm = TRUE)
          25%       75%
[1,] 59.05970  78.74466
[2,] 69.38825 211.35711
[3,] 65.46406  74.18446

but with zoo, it fails:

> loadNamespace("zoo")
> ("zoo" %in% loadedNamespaces())
[1] TRUE
> rowQuantiles(x, probs = c(0.25, 0.75), na.rm = TRUE)
Error in quantile.default(coredata(x), ...) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE

Workaround

The workaround is to force a coercion to matrix using as.matrix();

> ("zoo" %in% loadedNamespaces())
[1] TRUE
> rowQuantiles(as.matrix(x), probs = c(0.25, 0.75), na.rm = TRUE)
          25%       75%
[1,] 59.05970  78.74466
[2,] 69.38825 211.35711
[3,] 65.46406  74.18446

Troubleshooting

The reason for the error is that zoo implements S3 method quantile() for zoo objects;

> rowQuantiles(x, probs = c(0.25, 0.75), na.rm = TRUE)
Error in quantile.default(coredata(x), ...) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE
> traceback()
6: stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
5: quantile.default(coredata(x), ...)
4: quantile(coredata(x), ...)
3: quantile.zoo(xkk, probs = probs, na.rm = FALSE, type = type, 
       ...)
2: quantile(xkk, probs = probs, na.rm = FALSE, type = type, ...)
1: rowQuantiles(x, probs = c(0.25, 0.75), na.rm = TRUE)

When zoo is not loaded, then zoo::quantile.zoo() is not called, but stats:::quantile.default().

I need to think about this more, but it looks like it is a bug of zoo or at least lack of support for dropping missing values. Internally, this is what happens:

> xkk <- x[1,]
> xkk
                SPY UNG      XLE
2007-04-18 88.58714  NA 49.21721
> !is.na(xkk)
      SPY   UNG  XLE
[1,] TRUE FALSE TRUE

> xkk <- xkk[!is.na(xkk)]
> xkk
                SPY UNG      XLE
2007-04-18 88.58714  NA 49.21721

Note how the subsetting of xkk is ignored; all values remain afterward, including the NAs we wanna drop. I notice that xkk still has two dimensions and it seem impossible to extract the data as a vector, e.g.

> xkk <- x[1,]
> dim(xkk)
[1] 1 3
> xkk <- x[1, , drop = TRUE]
> dim(xkk)
[1] 1 3
mikebesso commented 7 years ago

Thank you for the quick turn around and in depth analysis. Your proposed work around works for me. And, I understand things a bit better now. But only a bit. So please accept my apologies if I am off base. But you did make me want to dig further.

I agree with your analysis that the "issue" likely resides inside of zoo. But, the following appears to address the issue:

coredata(xkk)[!is.na(coredata(xkk)), drop = FALSE]

Of course, it might create other issues.

Again, thank you for taking the time to not only look at the issue, but to also educate me a bit.

HenrikBengtsson commented 7 years ago

Yes, it's correct that calling zoo::coredata() would work. However, that would add a dependency on zoo for package matrixStats and we don't want to do that. Instead, we want a generic solution. I reached out to the author of zoo to discuss this. As an alternative, I think there is a generic workaround that can be done in matrixStats, but that would make it slightly less efficient.

So, for now, use:

as.matrix(x)

and pass that to matrixStats.

mikebesso commented 7 years ago

Again, thank you for your time and patience with me. Some I was thinking coredata was a general purpose function.

Your solution is working for me. All is good.

THANKS