Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

taking a single row from a TENxMatrix object is extremely slow #26

Closed ccshao closed 4 years ago

ccshao commented 4 years ago

I met a quite strange problem using the HDF5Array: taking a single row is extremely slow, comparing to take 2 or more rows/columns. However, Taking a single column is fast. Supprsingly, someting like c(100, 100), but not c(100), are also fast.

It is about 3000 times slower than taking a single column, and 6000 times slower than taking at least 2 columns/rows.

Some tests on the TENxMatrix, with lastest HDF5Array from bioconductor(1.14.3)


H_logcounts <- HDF5Array::writeTENxMatrix(D_s[["RNA"]]@data, P_logcounts, group = "grlogc")
t1 <- sample(100, 10)
microbenchmark(H_logcounts[100, ], H_logcounts[, 100], H_logcounts[t1, ], H_logcounts[, t1], H_logcounts[c(100, 100), ], H_logcounts[c(2, 100), ], H_logcounts[, c(2, 100)], times = 5)
Unit: milliseconds
                       expr          min           lq         mean       median
         H_logcounts[100, ] 12930.676885 13006.680246 13334.659438 13175.024264
         H_logcounts[, 100]     4.282595     4.453792     4.526677     4.475585
          H_logcounts[t1, ]     2.095773     2.105008     2.186020     2.202214
          H_logcounts[, t1]     2.155592     2.220851     2.320822     2.347605
 H_logcounts[c(100, 100), ]     2.092672     2.101198     2.179015     2.121373
   H_logcounts[c(2, 100), ]     2.126450     2.182361     2.256724     2.211836
   H_logcounts[, c(2, 100)]     2.292963     2.294334     2.394305     2.315868
           uq          max neval cld
 13486.484464 14074.431333     5   b
     4.650540     4.770873     5  a 
     2.251088     2.276018     5  a 
     2.402903     2.477158     5  a 
     2.272450     2.307381     5  a 
     2.355742     2.407232     5  a 
     2.402078     2.666280     5  a 

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 19.2

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] microbenchmark_1.4-7        SingleCellExperiment_1.8.0 
 [3] SummarizedExperiment_1.16.1 DelayedArray_0.12.2        
 [5] BiocParallel_1.20.1         matrixStats_0.55.0         
 [7] Biobase_2.46.0              GenomicRanges_1.38.0       
 [9] GenomeInfoDb_1.22.0         IRanges_2.20.2             
[11] S4Vectors_0.24.3            BiocGenerics_0.32.0        
[13] Matrix_1.2-18               data.table_1.12.8          
[15] cowplot_1.0.0               ggplot2_3.3.0              
[17] magrittr_1.5                ll_0.1.0                   
[19] colorout_1.2-1             

loaded via a namespace (and not attached):
  [1] useful_1.2.6             Seurat_3.1.1             ggbeeswarm_0.6.0        
  [4] TH.data_1.0-10           Rtsne_0.15               colorspace_1.4-1        
  [7] ggridges_0.5.2           XVector_0.26.0           BiocNeighbors_1.4.2     
 [10] farver_2.0.3             leiden_0.3.3             listenv_0.8.0           
 [13] npsurv_0.4-0             ggrepel_0.8.2            mvtnorm_1.1-0           
 [16] codetools_0.2-16         splines_3.6.3            R.methodsS3_1.8.0       
 [19] mnormt_1.5-6             lsei_1.2-0               scater_1.14.6           
 [22] TFisher_0.2.0            jsonlite_1.6.1           ica_1.0-2               
 [25] cluster_2.1.0            png_0.1-7                R.oo_1.23.0             
 [28] uwot_0.1.5               HDF5Array_1.14.3         sctransform_0.2.1       
 [31] compiler_3.6.3           httr_1.4.1               assertthat_0.2.1        
 [34] lazyeval_0.2.2           BiocSingular_1.2.2       htmltools_0.4.0         
 [37] tools_3.6.3              rsvd_1.0.3               igraph_1.2.4.2          
 [40] gtable_0.3.0             glue_1.3.2               GenomeInfoDbData_1.2.2  
 [43] reshape2_1.4.3           RANN_2.6.1               dplyr_0.8.5             
 [46] rappdirs_0.3.1           Rcpp_1.0.3               vctrs_0.2.4             
 [49] multtest_2.42.0          gdata_2.18.0             ape_5.3                 
 [52] nlme_3.1-144             DelayedMatrixStats_1.8.0 gbRd_0.4-11             
 [55] lmtest_0.9-37            stringr_1.4.0            globals_0.12.5          
 [58] lifecycle_0.2.0          irlba_2.3.3              gtools_3.8.1            
 [61] future_1.16.0            zlibbioc_1.32.0          MASS_7.3-51.5           
 [64] zoo_1.8-7                scales_1.1.0             sandwich_2.5-1          
 [67] rhdf5_2.30.1             RColorBrewer_1.1-2       qs_0.21.2               
 [70] gridExtra_2.3            reticulate_1.14          pbapply_1.4-2           
 [73] stringi_1.4.6            mutoss_0.1-12            plotrix_3.7-7           
 [76] caTools_1.18.0           bibtex_0.4.2.2           Rdpack_0.11-1           
 [79] SDMTools_1.1-221.2       rlang_0.4.5              pkgconfig_2.0.3         
 [82] bitops_1.0-6             lattice_0.20-40          Rhdf5lib_1.8.0          
 [85] ROCR_1.0-7               purrr_0.3.3              labeling_0.3            
 [88] htmlwidgets_1.5.1        tidyselect_1.0.0         RcppAnnoy_0.0.16        
 [91] plyr_1.8.6               R6_2.4.1                 gplots_3.0.3            
 [94] multcomp_1.4-12          pillar_1.4.3             withr_2.1.2             
 [97] sn_1.5-5                 fitdistrplus_1.0-14      survival_3.1-8          
[100] RCurl_1.98-1.1           tsne_0.1-3               tibble_2.1.3            
[103] future.apply_1.4.0       crayon_1.3.4             KernSmooth_2.23-16      
[106] RApiSerialize_0.1.0      plotly_4.9.2             viridis_0.5.1           
[109] grid_3.6.3               metap_1.3                digest_0.6.25           
[112] tidyr_1.0.2              numDeriv_2016.8-1.1      R.utils_2.9.2           
[115] RcppParallel_5.0.0       munsell_0.5.0            beeswarm_0.2.3          
[118] viridisLite_0.3.0        vipor_0.4.5             
hpages commented 4 years ago

2 things:

1. Taking a single row (or column) vs taking more than 1 row (or column) are TWO VERY DIFFERENT OPERATIONS

First let's see what happens with an ordinary matrix:

m <- matrix(1:30, nrow=5)

m[1, ]    # a vector
# [1]  1  6 11 16 21 26

m[1:2, ]  # a 2x6 matrix
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    1    6   11   16   21   26
# [2,]    2    7   12   17   22   27

So by default, when taking a single row (or column) from a matrix, one gets a vector instead of a matrix. Note that this can be changed by specifying drop=FALSE:

m[1, , drop=FALSE]  # a 1x6 matrix
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    1    6   11   16   21   26

DelayedMatrix objects mimic this: when taking a single row (or column) from a DelayedMatrix object, by default an ordinary vector is returned. Otherwise, another DelayedMatrix object is returned. For example, with an HDF5Matrix object (HDF5Matrix and TENxMatrix objects are particular types of DelayedMatrix objects):

M <- as(m, "HDF5Matrix")
M
# <5 x 6> matrix of class HDF5Matrix and type "integer":
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    1    6   11   16   21   26
# [2,]    2    7   12   17   22   27
# [3,]    3    8   13   18   23   28
# [4,]    4    9   14   19   24   29
# [5,]    5   10   15   20   25   30

M[1, ]
# [1]  1  6 11 16 21 26

M[1, , drop=FALSE]
# <1 x 6> matrix of class DelayedMatrix and type "integer":
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    1    6   11   16   21   26

M[1:2, ]
# <2 x 6> matrix of class DelayedMatrix and type "integer":
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    1    6   11   16   21   26
# [2,]    2    7   12   17   22   27

Now an even more important difference between M[1, ] and M[1:2, ] is this:

This fundamental difference between delayed vs non-delayed subsetting is documented in ?DelayedArray (Subsetting section).

Conclusion: comparing the speed of M[1, ] vs M[1:2, ] (or M[c(1, 1), ]) is meaningless. The former has to access the file and read the entire row. The latter does not access the file at all: all it does is keep track of the delayed operation by recording it in the returned object.

A more meaningful comparison would be to compare the speed of M[1, ] vs as.matrix(M[1:2, ]). In that case, both operations need to read the extracted rows from the file. If you do this on your object H_logcounts you will see that they take about the same time. Both will be slow (unfortunately).

2. Data in the 10x format is not intended to be accessed by row

The 10x format uses a compressed sparse column representation of the data. With this representation, the full indices component (this is the HDF5 dataset where the row indices are stored) needs to be loaded in memory before a full row can be read. This explains why H_logcounts[100, ] is so slow. This is in fact the worst case scenario! In other words, data in the 10x format is intended to be accessed by column (or groups of columns), not by row. If you need to access the data by row, save it as a conventional (i.e. dense) HDF5 dataset first (with writeHDF5Array()). Pay a particular attention to the chunk geometry that you choose (see ?writeHDF5Array).

Hope this helps, H.

ccshao commented 4 years ago

@hpages Thank you very much for the quick and in-depth explanations, much more than I would expect.

I tested your suggestions with writeHDF5Array on a smaller data sets, and modifying the chuckdim parameter. Indeed taking a single row is much faster with writeHDF5Array.

Suprisingly to me, smaller chunkdim c(100, 100) gave faster access? In general what are recommended value for chunkdim?

dim(D_s[["RNA"]]@data)
[1] 13833   723
P_logcounts <- "./logcounts.h5"
P_logcounts2 <- "./logcounts2.h5"
H_logcounts <- HDF5Array::writeTENxMatrix(D_s[["RNA"]]@data, P_logcounts, group="grlogc")
H_logcounts2 <- HDF5Array::writeHDF5Array(D_s[["RNA"]]@data, P_logcounts2, name="grlogc")
H_logcounts3 <- HDF5Array::writeHDF5Array(D_s[["RNA"]]@data, "./logcounts3.h5", name="grlogc", chunkdim = c(100, 100))
microbenchmark::microbenchmark(H_logcounts[100, ], H_logcounts[, 100],
                                H_logcounts2[100, ], H_logcounts2[, 100],
                                H_logcounts3[100, ], H_logcounts3[, 100],
                                times = 10)
Unit: milliseconds
                expr        min         lq       mean     median         uq
  H_logcounts[100, ] 249.911267 262.380767 354.653501 272.407510 290.421077
  H_logcounts[, 100]   4.546958   4.821871   5.146252   4.955936   5.250725
 H_logcounts2[100, ]  58.533349  65.112595  70.391925  70.280364  75.118305
 H_logcounts2[, 100]  61.983717  65.694318  69.155329  70.193500  72.444110
 H_logcounts3[100, ]   3.846225   4.083485   5.493897   4.301052   5.856033
 H_logcounts3[, 100]  22.581054  23.027121  26.970116  23.621959  30.005232
        max neval cld
 701.073605    10   b
   6.835683    10  a
  82.344922    10  a
  74.350799    10  a
  10.041147    10  a
  40.354176    10  a

Changing the size of DelayedArray has no impact.


options()$DelayedArray.block.size
NULL
options(DelayedArray.block.size=2e9)
options()$DelayedArray.block.size
[1] 2e+09
 microbenchmark::microbenchmark(H_logcounts[100, ], H_logcounts[, 100],
+                                H_logcounts2[100, ], H_logcounts2[, 100],
+                                H_logcounts3[100, ], H_logcounts3[, 100],
+                                times = 10)
Unit: milliseconds
                expr        min         lq       mean     median         uq
  H_logcounts[100, ] 252.843005 283.992337 337.029014 297.472046 318.542973
  H_logcounts[, 100]   4.628639   5.077837   5.257612   5.226790   5.358480
 H_logcounts2[100, ]  62.303898  66.919217  70.779986  68.728016  72.045088
 H_logcounts2[, 100]  65.781188  68.181873  72.359173  72.746999  73.800211
 H_logcounts3[100, ]   3.903571   4.041833   4.860554   4.302677   5.655599
 H_logcounts3[, 100]  22.556024  22.919530  25.001734  23.494084  24.112284
        max neval cld
 730.616213    10   b
   5.953905    10  a
  85.572762    10  a
  80.829545    10  a
   7.094391    10  a
  39.091279    10  a
LTLA commented 4 years ago

In general what are recommended value for chunkdim?

The "optimum" chunking scheme depends on your access pattern. For example, if you know that you will read a random row, then the best scheme is to make each row a chunk, i.e., chunkdim=c(1, ncol(mat)). This ensures that you never read (from disk) more data than is strictly necessary. Conversely, if you know that you will need to read a random column, you make each column a chunk, i.e., chunkdim=c(nrow(mat), 1). If you know that you will need to read consecutive rows, then you might make a multi-row chunk to reduce the number of read requests to the OS, and so on.

If you don't know what your access pattern will be, then the next best approach is to hedge your bets and make something that is not-too-bad for any given access pattern. In most cases, this involves making rectangular chunks, which is suboptimal but not abominable for both row and column access. Conversely, if we had, say, row chunks and we wanted to access a single column, we would have to load in the ENTIRE dataset chunk-by-chunk and extract the desired column value from each chunk!

IIRC, HDF5Array's default chunking scheme is to use chunk dimensions proportional to the size of the dataset, which is a good approach for consecutive row and column access under memory constraints. In your case, though, a request for a single row or column will load in much more data than is strictly necessary. Reducing the chunk size mitigates this effect, resulting in improved performance - however, this only works up to a certain extent, as many small chunks require increasing overhead to manage and extract data from (and also increases the size of the file).

tl;dr there is no single optimum chunking scheme for all access patterns. If you need fast row and column access, consider making two copies of the same file with different chunking schemes.

ccshao commented 4 years ago

@LTLA Thanks very much for the updates, indeed a square chunk dim shows a balance performance in the tested data.


logcounts <- "./logcounts.h5"
P_logcounts2 <- "./logcounts2.h5"
H_logcounts <- HDF5Array::writeTENxMatrix(D_s[["RNA"]]@data, P_logcounts, group="grlogc")
H_logcounts2 <- HDF5Array::writeHDF5Array(D_s[["RNA"]]@data, P_logcounts2, name="grlogc")
H_logcounts3 <- HDF5Array::writeHDF5Array(D_s[["RNA"]]@data, "./logcounts3.h5", name="grlogc", chunkdim = c(100, 100))
H_logcounts4 <- HDF5Array::writeHDF5Array(D_s[["RNA"]]@data, "./logcounts4.h5", name="grlogc", chunkdim = c(1, nrow(D_s[["RNA"]]@data)))
t1 <- sample(100, 10)
microbenchmark::microbenchmark(H_logcounts[100, ], H_logcounts[, 100],
                               H_logcounts2[100, ], H_logcounts2[, 100],
                               H_logcounts3[100, ], H_logcounts3[, 100],
                               H_logcounts4[100, ], H_logcounts4[, 100],
                               H_logcounts2[t1, ], H_logcounts2[, t1],
                               H_logcounts3[t1, ], H_logcounts3[, t1],
                               H_logcounts4[t1, ], H_logcounts4[, t1],
                               times = 10)
Unit: milliseconds
                expr        min         lq       mean     median         uq
  H_logcounts[100, ] 299.233826 301.272547 399.325006 308.756491 324.519017
  H_logcounts[, 100]   4.549893   5.017393   6.428904   5.316441   5.578046
 H_logcounts2[100, ]  66.295497  67.290718  71.745083  71.046112  74.346571
 H_logcounts2[, 100]  65.660536  67.951340  71.904682  70.525152  74.636732
 H_logcounts3[100, ]   3.698368   4.504492   4.877121   4.550175   4.640842
 H_logcounts3[, 100]  23.120689  24.264654  27.139369  27.248005  28.179468
 H_logcounts4[100, ]   2.711236   2.976416   3.458345   3.283627   3.560682
 H_logcounts4[, 100] 390.424229 401.477971 409.067359 406.152292 414.989851
  H_logcounts2[t1, ]   2.350678   2.541865   3.806603   2.699610   2.910438
  H_logcounts2[, t1]   2.364793   2.526137   2.730014   2.632939   3.008850
  H_logcounts3[t1, ]   2.252623   2.619572   2.853207   2.946617   3.122217
  H_logcounts3[, t1]   2.432676   2.482984   2.689345   2.664224   2.850759
  H_logcounts4[t1, ]   2.138416   2.243979   2.581604   2.723396   2.795297
  H_logcounts4[, t1]   2.186947   2.552777   3.000285   2.740208   2.831533
        max neval cld
 771.595118    10   b
  17.626441    10  a
  80.291808    10  a
  79.889627    10  a
   8.636092    10  a
  33.787042    10  a
   5.325188    10  a
 431.797458    10   b
  13.741159    10  a
   3.232565    10  a
   3.268262    10  a
   3.051320    10  a
   2.907943    10  a
   5.723909    10  a