Closed PeteHaitch closed 5 years ago
Thanks Pete for pointing this out. Very useful to know. Any news from the DelayedArray developers on this?
I would also like to note, that the linear indexing causes trouble when an index is larger than .Machine$integer.max
:
Error: subscript contains NAs In addition: Warning message: In NSBS(i, x, exact = exact, strict.upper.bound = !allow.append, : NAs introduced by coercion to integer range
I am assuming because indices larger than .Machine$integer.max
get converted into NAs somewhere downstream (sorry, have not yet been able to find out where exactly)
@PeteHaitch Great analysis Pete. as.matrix(x[i, ])
is intended to be the canonical way to load a random sample of rows into memory, and, as such, is expected to be fast. Unfortunately your analysis shows that it's not as fast as it should be. I'll look into this.
@demuellae Good point about linear indexing not supporting an index larger than .Machine$integer.max
yet. Adding this to the TODO list.
@PeteHaitch I just made a simple change to h5mread()
that addresses this.
Here are the timings I was getting before that change (HDF5Array 1.11.10) on my laptop:
Unit: milliseconds
expr min lq mean
loadAll(x, i) 118.5705 120.0589 126.6097
loadAllThenSubset(x, i) 115.0670 119.5294 142.3957
singleBracketSubset(x, i) 4826.6771 4932.9701 5092.7289
singleBracketWithSortSubset(x, i) 4780.0357 4893.8078 5126.6864
linearSingleBracketSubset(x, i) 137.1954 142.8483 162.3793
linearSingleBracketSubsetWithSort(x, i) 140.8591 142.0944 152.5863
extractArray(x, i) 4894.6306 4954.9751 5131.6610
median uq max neval
125.3347 128.1937 144.4930 10
123.7600 135.5741 294.1190 10
4975.2277 5073.6786 5672.9008 10
4992.6848 5250.4584 5891.6236 10
145.2787 149.7091 320.9898 10
154.1817 156.8348 176.0262 10
5138.1265 5211.9729 5618.9588 10
Consistent with yours.
And here are the timings I get after this change (HDF5Array 1.11.11):
Unit: milliseconds
expr min lq mean median
loadAll(x, i) 129.6935 133.9096 196.8221 136.2655
loadAllThenSubset(x, i) 129.7976 132.2286 134.2168 133.6906
singleBracketSubset(x, i) 111.3307 113.8825 117.2072 117.4472
singleBracketWithSortSubset(x, i) 112.0190 113.0960 115.3418 114.2827
linearSingleBracketSubset(x, i) 144.8725 154.0804 250.3987 159.3883
linearSingleBracketSubsetWithSort(x, i) 147.2243 147.6458 154.1183 153.9714
extractArray(x, i) 110.1772 111.3872 114.4275 113.4713
uq max neval
145.1341 442.3373 10
137.2730 138.7781 10
120.6242 122.0970 10
116.9427 120.6199 10
470.1006 479.0221 10
158.0792 163.7452 10
114.8213 126.3528 10
So this brings back as.matrix(x[i, j])
as the recommended way to load random subsets of data into memory.
@demuellae I'm going to move the issue about linear indexing not supporting an index larger than .Machine$integer.max
to a new issue.
That's great! Thanks @hpages!
Hi Hervé,
I'm looking to improve the performance of loading random subsets of disk-backed data into memory. My specific case is loading random rows of an HDF5Matrix into memory, but it may be a more general issue.
I use an example of loading 10,000 randomly sampled rows of a 500,000 x 6 HDF5Matrix into memory as an ordinary matrix. I initially compared 2 strategies: "single bracket subset" and "linear single bracket subset". For each of these strategies I also looked to see whether it helped to first sort the index (I thought this might help by improving the contiguity of data access). As a baseline, I compared these with the naive (and generally unviable) method of loading all the data into memory and then subsetting. I later thought that
extract_array()
might be what I'm really looking for and added it to the comparison. In my package and analysis code, I've been using "single bracket subset" (i.e.as.matrix(x[i, ])
) quite a bit and found this to be a bit of a bottleneck, which is what prompted this investigation.Remarkably, the "linear single bracket subset" method is only 3x slower than the "load all data then subset" strategy. In contrast, the "single bracket subset" and "extract array" methods are 60-70x slower than the "load all data then subset" strategy.
It looks like the "linear single brack subset" method trades off increased memory allocations for this impressive performance. But my understanding is "linear single bracket subset" is using the block-processing machinery, so memory usage should be controllable using the tools that are already in place.
I think this is a fair analysis, but I may have overlooked something.
What I'm left wondering is: What is (or should be) the canonical way for loading a subset of a DelayedArray into memory as an ordinary array, especially row- or column-slices of an HDF5Matrix? Can existing functions benefit by leveraging "linear single brack subset"? Might there be a need for explicit methods that offer different ways of handling the memory vs. performance tradeoff?
I look forward to hearing your thoughts.
Thanks, Pete
Created on 2018-03-28 by the reprex package (v0.2.0).
Session info
``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R Under development (unstable) (2018-03-05 r74359) #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> tz America/New_York #> date 2018-03-28 #> Packages ----------------------------------------------------------------- #> package * version date source #> backports 1.1.2 2017-12-13 CRAN (R 3.5.0) #> base * 3.5.0 2018-03-06 local #> BiocGenerics * 0.25.3 2018-02-09 Bioconductor #> BiocParallel * 1.13.3 2018-03-23 Bioconductor #> codetools 0.2-15 2016-10-05 CRAN (R 3.5.0) #> compiler 3.5.0 2018-03-06 local #> datasets * 3.5.0 2018-03-06 local #> DelayedArray * 0.5.22 2018-03-02 Bioconductor #> devtools 1.13.5 2018-02-18 CRAN (R 3.5.0) #> digest 0.6.15 2018-01-28 CRAN (R 3.5.0) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.5.0) #> graphics * 3.5.0 2018-03-06 local #> grDevices * 3.5.0 2018-03-06 local #> grid 3.5.0 2018-03-06 local #> HDF5Array * 1.7.9 2018-03-02 Bioconductor #> htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0) #> IRanges * 2.13.28 2018-02-24 Bioconductor #> knitr 1.20 2018-02-20 CRAN (R 3.5.0) #> lattice 0.20-35 2017-03-25 CRAN (R 3.5.0) #> magrittr 1.5 2014-11-22 CRAN (R 3.5.0) #> MASS 7.3-49 2018-02-23 CRAN (R 3.5.0) #> Matrix 1.2-12 2017-11-20 CRAN (R 3.5.0) #> matrixStats * 0.53.1 2018-02-11 CRAN (R 3.5.0) #> memoise 1.1.0 2017-04-21 CRAN (R 3.5.0) #> methods * 3.5.0 2018-03-06 local #> microbenchmark * 1.4-4 2018-01-24 CRAN (R 3.5.0) #> multcomp 1.4-8 2017-11-08 CRAN (R 3.5.0) #> mvtnorm 1.0-7 2018-01-25 CRAN (R 3.5.0) #> parallel * 3.5.0 2018-03-06 local #> profmem * 0.5.0 2018-01-30 CRAN (R 3.5.0) #> Rcpp 0.12.16 2018-03-13 CRAN (R 3.5.0) #> rhdf5 * 2.23.6 2018-02-15 Github (Bioconductor/rhdf5@f452f9e) #> Rhdf5lib 1.1.5 2018-01-11 Bioconductor #> rmarkdown 1.9 2018-03-01 CRAN (R 3.5.0) #> rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0) #> S4Vectors * 0.17.38 2018-03-28 Bioconductor #> sandwich 2.4-0 2017-07-26 CRAN (R 3.5.0) #> splines 3.5.0 2018-03-06 local #> stats * 3.5.0 2018-03-06 local #> stats4 * 3.5.0 2018-03-06 local #> stringi 1.1.7 2018-03-12 CRAN (R 3.5.0) #> stringr 1.3.0 2018-02-19 CRAN (R 3.5.0) #> survival 2.41-3 2017-04-04 CRAN (R 3.5.0) #> TH.data 1.0-8 2017-01-23 CRAN (R 3.5.0) #> tools 3.5.0 2018-03-06 local #> utils * 3.5.0 2018-03-06 local #> withr 2.1.2 2018-03-15 CRAN (R 3.5.0) #> yaml 2.1.18 2018-03-08 CRAN (R 3.5.0) #> zoo 1.8-1 2018-01-08 CRAN (R 3.5.0) ```