Closed ccshao closed 4 years ago
2 things:
1. Taking a single row (or column) vs taking more than 1 row (or column) are TWO VERY DIFFERENT OPERATIONS
First let's see what happens with an ordinary matrix:
m <- matrix(1:30, nrow=5)
m[1, ] # a vector
# [1] 1 6 11 16 21 26
m[1:2, ] # a 2x6 matrix
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 6 11 16 21 26
# [2,] 2 7 12 17 22 27
So by default, when taking a single row (or column) from a matrix, one gets a vector instead of a matrix. Note that this can be changed by specifying drop=FALSE
:
m[1, , drop=FALSE] # a 1x6 matrix
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 6 11 16 21 26
DelayedMatrix objects mimic this: when taking a single row (or column) from a DelayedMatrix object, by default an ordinary vector is returned. Otherwise, another DelayedMatrix object is returned. For example, with an HDF5Matrix object (HDF5Matrix and TENxMatrix objects are particular types of DelayedMatrix objects):
M <- as(m, "HDF5Matrix")
M
# <5 x 6> matrix of class HDF5Matrix and type "integer":
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 6 11 16 21 26
# [2,] 2 7 12 17 22 27
# [3,] 3 8 13 18 23 28
# [4,] 4 9 14 19 24 29
# [5,] 5 10 15 20 25 30
M[1, ]
# [1] 1 6 11 16 21 26
M[1, , drop=FALSE]
# <1 x 6> matrix of class DelayedMatrix and type "integer":
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 6 11 16 21 26
M[1:2, ]
# <2 x 6> matrix of class DelayedMatrix and type "integer":
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 6 11 16 21 26
# [2,] 2 7 12 17 22 27
Now an even more important difference between M[1, ]
and M[1:2, ]
is this:
M[1, ]
case), all the data that ends up in the vector gets read from the file and loaded in memory;M[1:2, ]
situation), the subsetting operation is delayed so nothing actually gets read from the file. Except for displaying purpose e.g. in this case only the data needed to display M[1:2, ]
gets read from the file. But if you were doing MM <- M[1:2, ]
, nothing would be read from the file (the file wouldn't be accessed at all).This fundamental difference between delayed vs non-delayed subsetting is documented in ?DelayedArray
(Subsetting section).
Conclusion: comparing the speed of M[1, ]
vs M[1:2, ]
(or M[c(1, 1), ]
) is meaningless. The former has to access the file and read the entire row. The latter does not access the file at all: all it does is keep track of the delayed operation by recording it in the returned object.
A more meaningful comparison would be to compare the speed of M[1, ]
vs as.matrix(M[1:2, ])
. In that case, both operations need to read the extracted rows from the file. If you do this on your object H_logcounts
you will see that they take about the same time. Both will be slow (unfortunately).
2. Data in the 10x format is not intended to be accessed by row
The 10x format uses a compressed sparse column representation of the data. With this representation, the full indices
component (this is the HDF5 dataset where the row indices are stored) needs to be loaded in memory before a full row can be read. This explains why H_logcounts[100, ]
is so slow. This is in fact the worst case scenario! In other words, data in the 10x format is intended to be accessed by column (or groups of columns), not by row. If you need to access the data by row, save it as a conventional (i.e. dense) HDF5 dataset first (with writeHDF5Array()
). Pay a particular attention to the chunk geometry that you choose (see ?writeHDF5Array
).
Hope this helps, H.
@hpages Thank you very much for the quick and in-depth explanations, much more than I would expect.
I tested your suggestions with writeHDF5Array
on a smaller data sets, and modifying the chuckdim
parameter. Indeed taking a single row is much faster with writeHDF5Array
.
Suprisingly to me, smaller chunkdim c(100, 100) gave faster access?
In general what are recommended value for chunkdim
?
dim(D_s[["RNA"]]@data)
[1] 13833 723
P_logcounts <- "./logcounts.h5"
P_logcounts2 <- "./logcounts2.h5"
H_logcounts <- HDF5Array::writeTENxMatrix(D_s[["RNA"]]@data, P_logcounts, group="grlogc")
H_logcounts2 <- HDF5Array::writeHDF5Array(D_s[["RNA"]]@data, P_logcounts2, name="grlogc")
H_logcounts3 <- HDF5Array::writeHDF5Array(D_s[["RNA"]]@data, "./logcounts3.h5", name="grlogc", chunkdim = c(100, 100))
microbenchmark::microbenchmark(H_logcounts[100, ], H_logcounts[, 100],
H_logcounts2[100, ], H_logcounts2[, 100],
H_logcounts3[100, ], H_logcounts3[, 100],
times = 10)
Unit: milliseconds
expr min lq mean median uq
H_logcounts[100, ] 249.911267 262.380767 354.653501 272.407510 290.421077
H_logcounts[, 100] 4.546958 4.821871 5.146252 4.955936 5.250725
H_logcounts2[100, ] 58.533349 65.112595 70.391925 70.280364 75.118305
H_logcounts2[, 100] 61.983717 65.694318 69.155329 70.193500 72.444110
H_logcounts3[100, ] 3.846225 4.083485 5.493897 4.301052 5.856033
H_logcounts3[, 100] 22.581054 23.027121 26.970116 23.621959 30.005232
max neval cld
701.073605 10 b
6.835683 10 a
82.344922 10 a
74.350799 10 a
10.041147 10 a
40.354176 10 a
Changing the size of DelayedArray has no impact.
options()$DelayedArray.block.size
NULL
options(DelayedArray.block.size=2e9)
options()$DelayedArray.block.size
[1] 2e+09
microbenchmark::microbenchmark(H_logcounts[100, ], H_logcounts[, 100],
+ H_logcounts2[100, ], H_logcounts2[, 100],
+ H_logcounts3[100, ], H_logcounts3[, 100],
+ times = 10)
Unit: milliseconds
expr min lq mean median uq
H_logcounts[100, ] 252.843005 283.992337 337.029014 297.472046 318.542973
H_logcounts[, 100] 4.628639 5.077837 5.257612 5.226790 5.358480
H_logcounts2[100, ] 62.303898 66.919217 70.779986 68.728016 72.045088
H_logcounts2[, 100] 65.781188 68.181873 72.359173 72.746999 73.800211
H_logcounts3[100, ] 3.903571 4.041833 4.860554 4.302677 5.655599
H_logcounts3[, 100] 22.556024 22.919530 25.001734 23.494084 24.112284
max neval cld
730.616213 10 b
5.953905 10 a
85.572762 10 a
80.829545 10 a
7.094391 10 a
39.091279 10 a
In general what are recommended value for
chunkdim
?
The "optimum" chunking scheme depends on your access pattern. For example, if you know that you will read a random row, then the best scheme is to make each row a chunk, i.e., chunkdim=c(1, ncol(mat))
. This ensures that you never read (from disk) more data than is strictly necessary. Conversely, if you know that you will need to read a random column, you make each column a chunk, i.e., chunkdim=c(nrow(mat), 1)
. If you know that you will need to read consecutive rows, then you might make a multi-row chunk to reduce the number of read requests to the OS, and so on.
If you don't know what your access pattern will be, then the next best approach is to hedge your bets and make something that is not-too-bad for any given access pattern. In most cases, this involves making rectangular chunks, which is suboptimal but not abominable for both row and column access. Conversely, if we had, say, row chunks and we wanted to access a single column, we would have to load in the ENTIRE dataset chunk-by-chunk and extract the desired column value from each chunk!
IIRC, HDF5Array's default chunking scheme is to use chunk dimensions proportional to the size of the dataset, which is a good approach for consecutive row and column access under memory constraints. In your case, though, a request for a single row or column will load in much more data than is strictly necessary. Reducing the chunk size mitigates this effect, resulting in improved performance - however, this only works up to a certain extent, as many small chunks require increasing overhead to manage and extract data from (and also increases the size of the file).
tl;dr there is no single optimum chunking scheme for all access patterns. If you need fast row and column access, consider making two copies of the same file with different chunking schemes.
@LTLA Thanks very much for the updates, indeed a square chunk dim shows a balance performance in the tested data.
logcounts <- "./logcounts.h5"
P_logcounts2 <- "./logcounts2.h5"
H_logcounts <- HDF5Array::writeTENxMatrix(D_s[["RNA"]]@data, P_logcounts, group="grlogc")
H_logcounts2 <- HDF5Array::writeHDF5Array(D_s[["RNA"]]@data, P_logcounts2, name="grlogc")
H_logcounts3 <- HDF5Array::writeHDF5Array(D_s[["RNA"]]@data, "./logcounts3.h5", name="grlogc", chunkdim = c(100, 100))
H_logcounts4 <- HDF5Array::writeHDF5Array(D_s[["RNA"]]@data, "./logcounts4.h5", name="grlogc", chunkdim = c(1, nrow(D_s[["RNA"]]@data)))
t1 <- sample(100, 10)
microbenchmark::microbenchmark(H_logcounts[100, ], H_logcounts[, 100],
H_logcounts2[100, ], H_logcounts2[, 100],
H_logcounts3[100, ], H_logcounts3[, 100],
H_logcounts4[100, ], H_logcounts4[, 100],
H_logcounts2[t1, ], H_logcounts2[, t1],
H_logcounts3[t1, ], H_logcounts3[, t1],
H_logcounts4[t1, ], H_logcounts4[, t1],
times = 10)
Unit: milliseconds
expr min lq mean median uq
H_logcounts[100, ] 299.233826 301.272547 399.325006 308.756491 324.519017
H_logcounts[, 100] 4.549893 5.017393 6.428904 5.316441 5.578046
H_logcounts2[100, ] 66.295497 67.290718 71.745083 71.046112 74.346571
H_logcounts2[, 100] 65.660536 67.951340 71.904682 70.525152 74.636732
H_logcounts3[100, ] 3.698368 4.504492 4.877121 4.550175 4.640842
H_logcounts3[, 100] 23.120689 24.264654 27.139369 27.248005 28.179468
H_logcounts4[100, ] 2.711236 2.976416 3.458345 3.283627 3.560682
H_logcounts4[, 100] 390.424229 401.477971 409.067359 406.152292 414.989851
H_logcounts2[t1, ] 2.350678 2.541865 3.806603 2.699610 2.910438
H_logcounts2[, t1] 2.364793 2.526137 2.730014 2.632939 3.008850
H_logcounts3[t1, ] 2.252623 2.619572 2.853207 2.946617 3.122217
H_logcounts3[, t1] 2.432676 2.482984 2.689345 2.664224 2.850759
H_logcounts4[t1, ] 2.138416 2.243979 2.581604 2.723396 2.795297
H_logcounts4[, t1] 2.186947 2.552777 3.000285 2.740208 2.831533
max neval cld
771.595118 10 b
17.626441 10 a
80.291808 10 a
79.889627 10 a
8.636092 10 a
33.787042 10 a
5.325188 10 a
431.797458 10 b
13.741159 10 a
3.232565 10 a
3.268262 10 a
3.051320 10 a
2.907943 10 a
5.723909 10 a
I met a quite strange problem using the
HDF5Array
: taking a single row is extremely slow, comparing to take 2 or more rows/columns. However, Taking a single column is fast. Supprsingly, someting like c(100, 100), but not c(100), are also fast.It is about 3000 times slower than taking a single column, and 6000 times slower than taking at least 2 columns/rows.
Some tests on the TENxMatrix, with lastest HDF5Array from bioconductor(1.14.3)
sessionInfo()