Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

scale() not working on TENxBrainData #79

Closed pablo-rodr-bio2 closed 3 years ago

pablo-rodr-bio2 commented 3 years ago

I've tried using the new function DelayedArray::scale() on the assay of TENxBrainData20k() and it's giving back an error

This is on a bioconductor_docker container after installing all the proper packages:

First I do a test with a simple hdf5 matrix and everything works:

library(HDF5Array)

set.seed(123)
m <- matrix(sample.int(10, 25, T), 15, 15)
m <- as(m, "HDF5Array")
DelayedArray::scale(m)

# <15 x 15> matrix of class DelayedMatrix and type "double":
#   [,1]        [,2]        [,3] ...      [,14]      [,15]
# [1,] -1.10978156 -0.85544059 -0.82447161   .  1.1885167 -0.4855605
# [2,] -1.10978156  0.70938976 -1.21094268   . -0.8297192 -1.1476884
# [3,]  1.31786561  1.33532190 -0.43800054   . -0.4933465  0.8386953
# [4,] -1.45658830  0.39642369  0.72141266   . -1.5024645  0.8386953
# [5,] -0.06936135  1.33532190  1.10788373   .  0.5157714  0.8386953
# ...           .           .           .   .          .          .
# [11,] -0.41616809 -0.85544059 -1.59741375   . -0.1569739  0.8386953
# [12,] -1.10978156 -0.85544059  0.33494159   . -0.4933465 -1.1476884
# [13,]  0.97105887  1.33532190  1.10788373   .  0.1793987 -0.8166244
# [14,]  0.97105887 -1.16840666 -0.05152948   .  1.1885167 -1.8098163
# [15,]  0.97105887  0.08345762  1.10788373   .  1.5248893  0.1765674

But then, when using the TENx experiment, there's an error:

library(TENxBrainData)

expr <- assay(TENxBrainData20k(), "counts")
DelayedArray::scale(expr)
# Error in center[subset] : invalid subscript type 'S4'
sessionInfo()
R Under development (unstable) (2020-11-18 r79449)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TENxBrainData_1.11.0        SingleCellExperiment_1.13.1 SummarizedExperiment_1.21.0 Biobase_2.51.0             
 [5] GenomicRanges_1.43.0        GenomeInfoDb_1.27.1         HDF5Array_1.19.0            rhdf5_2.35.0               
 [9] DelayedArray_0.17.2         IRanges_2.25.2              S4Vectors_0.29.3            MatrixGenerics_1.3.0       
[13] matrixStats_0.57.0          BiocGenerics_0.37.0         Matrix_1.2-18               BiocManager_1.30.10        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5                    lattice_0.20-41               assertthat_0.2.1              digest_0.6.27                
 [5] mime_0.9                      BiocFileCache_1.15.1          R6_2.5.0                      RSQLite_2.2.1                
 [9] httr_1.4.2                    pillar_1.4.7                  sparseMatrixStats_1.3.0       zlibbioc_1.37.0              
[13] rlang_0.4.8                   curl_4.3                      rstudioapi_0.13               blob_1.2.1                   
[17] AnnotationHub_2.23.0          RCurl_1.98-1.2                bit_4.0.4                     shiny_1.5.0                  
[21] compiler_4.1.0                httpuv_1.5.4                  pkgconfig_2.0.3               htmltools_0.5.0              
[25] tidyselect_1.1.0              tibble_3.0.4                  GenomeInfoDbData_1.2.4        interactiveDisplayBase_1.29.0
[29] withr_2.3.0                   crayon_1.3.4                  dplyr_1.0.2                   dbplyr_2.0.0                 
[33] later_1.1.0.1                 bitops_1.0-6                  rhdf5filters_1.3.0            rappdirs_0.3.1               
[37] grid_4.1.0                    xtable_1.8-4                  lifecycle_0.2.0               DBI_1.1.0                    
[41] magrittr_2.0.1                XVector_0.31.0                promises_1.1.1                DelayedMatrixStats_1.13.0    
[45] ellipsis_0.3.1                filelock_1.0.2                generics_0.1.0                vctrs_0.3.5                  
[49] Rhdf5lib_1.13.0               tools_4.1.0                   bit64_4.0.5                   glue_1.4.2                   
[53] purrr_0.3.4                   BiocVersion_3.13.1            fastmap_1.0.1                 yaml_2.2.1                   
[57] AnnotationDbi_1.53.0          ExperimentHub_1.17.0          memoise_1.1.0
pablo-rodr-bio2 commented 3 years ago

Strangely, if I try with another big HDF5Array, scale() works perfectly too, so maybe the problem is in the TENxBrainData end

library(scRNAseq)
sce <- ReprocessedAllenData("tophat_counts")
sce <- as(assay(sce, "tophat_counts"), "HDF5Array")
DelayedArray::scale(sce)

# <20816 x 379> matrix of class DelayedMatrix and type "double":
#   SRR2140028   SRR2140022   SRR2140055 ...  SRR2139341  SRR2139336
# 0610007P14Rik -0.003141943  0.118237602  1.154833503   .  0.24846387  0.05805555
# 0610009B22Rik -0.056619890 -0.191826493  0.077402182   .  0.15019342  0.38569442
# 0610009L18Rik -0.198670686 -0.191826493 -0.206755529   . -0.18358724 -0.17519212
# 0610009O20Rik -0.198670686  0.812372858 -0.206755529   .  0.03865082  0.15946758
# 0610010F05Rik -0.198670686 -0.191826493 -0.141636054   .  0.13409740  0.02295138
# ...            .            .            .   .           .           .
# Zyg11a   -0.1986707   -0.1918265   -0.2067555   . -0.18358724 -0.17519212
# Zyg11b    0.1163478    0.2643378    0.3773464   .  0.05079343 -0.09250231
# Zyx   -0.1986707   -0.1848086   -0.2067555   .  0.04316900 -0.17519212
# Zzef1   -0.1677537    0.2534919   -0.2047822   . -0.18358724  0.35371062
# Zzz3   -0.1978351   -0.1375972   -0.2067555   . -0.18245769 -0.17519212
hpages commented 3 years ago

Thanks @pablo-rodr-bio2 for the report. The scale() method for DelayedMatrix objects uses rowVars() internally, which unfortunately seems to be broken on DelayedMatrix objects for certain block sizes:

library(HDF5Array)
M <- as(matrix(runif(300), ncol=15), "HDF5Matrix")
setAutoBlockSize(500)

library(DelayedMatrixStats)
rowVars(M, center=0)
# Error in center[subset] : invalid subscript type 'S4'

The rowVars() method for DelayedMatrix objects is defined in DelayedMatrixStats.

I started to work on a PR that addresses this issue plus other issues related to how various DelayedMatrixStats methods handle the center argument.

hpages commented 3 years ago

Hi @pablo-rodr-bio2 ,

This should be addressed in DelayedMatrixStats 1.13.1 which will become available via BiocManager::install() in the next 15h or so. Please let me know if that addresses the issue for you so we can close this.

Thanks, H.

pablo-rodr-bio2 commented 3 years ago

Yes! Now it works correctly, thank you so much!