PeteHaitch / DelayedMatrixStats

A port of the matrixStats API to work with DelayedMatrix objects from the DelayedArray package
Other
15 stars 7 forks source link

More issues with how the 'center' argument is handled #68

Closed hpages closed 4 years ago

hpages commented 4 years ago

This is a continuation of issue #64. Looks like even after PR #65, a few issues remain with how the center argument is handled.

1st issue:

When rows is supplied, center is not subsetted accordingly.

Works as expected with an ordinary matrix:

library(matrixStats)
m <- matrix(1:30, nrow=6) * 10^(0:5)

rv1 <- rowVars(m)
rv1
# [1] 9e+01 9e+03 9e+05 9e+07 9e+09 9e+11

rv1b <- rowVars(m, center=rowMeans(m))
rv1b
# [1] 9e+01 9e+03 9e+05 9e+07 9e+09 9e+11

all.equal(rv1, rv1b)
# [1] TRUE

rv1c <- rowVars(m, rows=5:6)
rv1c
# [1] 9e+09 9e+11

all.equal(rv1[5:6], rv1c)
# [1] TRUE

rv1d <- rowVars(m, rows=5:6, center=rowMeans(m))
rv1d
# [1] 9e+09 9e+11

all.equal(rv1c, rv1d)
# [1] TRUE

but not with a DelayedMatrix derivative:

library(HDF5Array)
M <- as(m, "HDF5Matrix")

library(DelayedMatrixStats)
rv2 <- rowVars(M)
all.equal(rv1, rv2)
# [1] TRUE

rv2b <- rowVars(M, center=rowMeans(M))
all.equal(rv1, rv2b)
# [1] TRUE

rv2c <- rowVars(M, rows=5:6)
all.equal(rv1c, rv2c)
# [1] TRUE

rv2d <- rowVars(M, rows=5:6, center=rowMeans(M))
rv2d
# [1] 4.512500e+10 4.950000e+12 4.512219e+10 4.949680e+12 9.000000e+09
# [6] 9.000000e+11

2nd issue:

Internal helper .DelayedMatrix_block_rowVars() breaks if the grid is non-trivial (i.e. has more than 1 block) and if the blocks are made of more than 1 row.

Works fine with a trivial grid (single block):

total_data_size <- length(M) * get_type_size(type(M))
setAutoBlockSize(total_data_size)
rowAutoGrid(M)
# 1 x 1 RegularArrayGrid object on a 6 x 5 array:
#      [,1]
# [1,] [ ,]

rowVars(M, center=rowMeans(M))
# [1] 9e+01 9e+03 9e+05 9e+07 9e+09 9e+11

but not with a grid made of 2 blocks (4 rows and 2 rows, respectively):

setAutoBlockSize(total_data_size * 2/3)
rowAutoGrid(M)
# 2 x 1 RegularArrayGrid object on a 6 x 5 array:
#      [,1]  
# [1,] [1-4,]
# [2,] [5-6,]

rowVars(M, center=rowMeans(M))
# Error in center[subset] : invalid subscript type 'S4'

Note that this 2nd issue breaks the scale() method for DelayedMatrix objects (see https://github.com/Bioconductor/DelayedArray/issues/79).

The issues described above affect all the methods with a center argument i.e. row/colVars(), row/colMads(), and row/colWeightedMads().

I'm working on a patch.

H.

sessionInfo():

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.3/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.3/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DelayedMatrixStats_1.12.0 HDF5Array_1.18.0         
 [3] rhdf5_2.34.0              DelayedArray_0.16.0      
 [5] IRanges_2.24.0            S4Vectors_0.28.0         
 [7] MatrixGenerics_1.2.0      BiocGenerics_0.36.0      
 [9] Matrix_1.2-18             matrixStats_0.57.0       

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5              lattice_0.20-41         rhdf5filters_1.2.0     
[4] grid_4.0.3              sparseMatrixStats_1.2.0 Rhdf5lib_1.12.0        
[7] tools_4.0.3             compiler_4.0.3         
LTLA commented 4 years ago

1. Oops. Guess the test suite wasn't as exhaustive as I thought.

2. Double-oops. Though I did it right (with expand.NSBS=TRUE) for my own functions, so dunno what I was thinking here.

hpages commented 4 years ago

No worries. I'll try to expand the test suite to cover this.

zznx commented 4 years ago

rowVars=n/(n-1)(1/n ∑▒〖x^2-¯x^2)〗,center=¯x^2 ,Does this formula hold if you specify another center @hpages ,thank you very much

hpages commented 4 years ago

I wonder if we shouldn't try to come up with a test suite that can be shared by all the packages that define matrixStats methods (currently sparseMatrixStats, DelayedArray, and DelayedMatrixStats). After all what we do when we test the correctness of the rowVars() method for DelayedMatrix objects is very similar to what we do when we test the correctness of the rowVars() method for xgCMatrix objects, only the input is different. So it seems that there is a lot of room for sharing here. Getting a test suite that covers all possible uses of something like rowVars() is actually a significant amount of work and it's easy to forget to test some weird combinations of the arguments so it would be really nice to do this only once.

Anyway, probably for another day.

zznx commented 4 years ago

Hi, I don't think this code makes sense. What do you think? Looking forward to your reply

.scaled_colranks_safe <- function(x) {
    out <- colRanks(DelayedArray(x), ties.method="average")
    center <- (nrow(x) + 1)/2
    sum.sq <- rowVars(out, center=center) * (nrow(x)-1)
    sum.sq <- pmax(1e-8, sum.sq)
    (out - center)/(sqrt(sum.sq) * 2)
}
LTLA commented 4 years ago

@zznx please restrict your comments to LTLA/SingleR#160. If I wasn't going to answer your question there, I'm not going to answer your question here.

zznx commented 4 years ago

I'm sorry to bother you

LTLA commented 4 years ago

I wonder if we shouldn't try to come up with a test suite that can be shared by all the packages that define matrixStats methods

Yes, you could imagine stuffing some setup scripts in MatrixGenerics's inst/ that define some sensible test functions, which we can retrieve and run in our unit testing framework of choice. It would need some very well-considered debugging capability when tests fail, though; this was a major PITA with the current test suite in DMS, as I had to go in and debug() the test function, and then also figure out the exact iteration of Map that failed.

A failure message along the lines of:

would be very nice.

(As an aside, the test suite for DMS effectively tests SMS as well, because it needs to check for correct sparse dispatch.)

hpages commented 4 years ago

@zznx,

Hard to read your formula, it didn't render very clearly. An extended variance defined for x (a numeric vector, not a matrix, for the sake of clarity) as:

sum((x - center)^2) / (length(x) - 1)

is what it is. It has the benefit to be equal to the traditional variance when center is set to mean(x) and the fact that the user can choose a different center adds some nice flexibility. So I don't really understand all the dance and hesitation around this. See https://github.com/const-ae/sparseMatrixStats/issues/13. But as pointed by @LTLA your question is off-topic here.

hpages commented 4 years ago

I'll try to expand the test suite to cover this.

I gave up on this, sorry. I'd rather spend my time on working on an extensive and reusable testing framework for all the MatrixGenerics's generics. I'll actually need this very soon.

PeteHaitch commented 4 years ago

A common test suite would be brilliant. I think I tried to copy the matrixstats test suite when I initially implemented DMS, but it may have fallen behind. Making the tests more debugable is definitely important

hpages commented 4 years ago

Fixed in DelayedMatrixStats 1.13.1.

hpages commented 4 years ago

BTW I just pushed the changes to the RELEASE_3_12 branch (DelayedMatrixStats 1.12.1). Hope it's ok.

PeteHaitch commented 4 years ago

BTW I just pushed the changes to the RELEASE_3_12 branch (DelayedMatrixStats 1.12.1). Hope it's ok.

Thanks, hadn't thought of that.