Improving performance of loading random subsets of data into memory

PeteHaitch commented 6 years ago

Hi Hervé,

I'm looking to improve the performance of loading random subsets of disk-backed data into memory. My specific case is loading random rows of an HDF5Matrix into memory, but it may be a more general issue.

I use an example of loading 10,000 randomly sampled rows of a 500,000 x 6 HDF5Matrix into memory as an ordinary matrix. I initially compared 2 strategies: "single bracket subset" and "linear single bracket subset". For each of these strategies I also looked to see whether it helped to first sort the index (I thought this might help by improving the contiguity of data access). As a baseline, I compared these with the naive (and generally unviable) method of loading all the data into memory and then subsetting. I later thought that extract_array() might be what I'm really looking for and added it to the comparison. In my package and analysis code, I've been using "single bracket subset" (i.e. as.matrix(x[i, ])) quite a bit and found this to be a bit of a bottleneck, which is what prompted this investigation.

Remarkably, the "linear single bracket subset" method is only 3x slower than the "load all data then subset" strategy. In contrast, the "single bracket subset" and "extract array" methods are 60-70x slower than the "load all data then subset" strategy.

It looks like the "linear single brack subset" method trades off increased memory allocations for this impressive performance. But my understanding is "linear single bracket subset" is using the block-processing machinery, so memory usage should be controllable using the tools that are already in place.

I think this is a fair analysis, but I may have overlooked something.

What I'm left wondering is: What is (or should be) the canonical way for loading a subset of a DelayedArray into memory as an ordinary array, especially row- or column-slices of an HDF5Matrix? Can existing functions benefit by leveraging "linear single brack subset"? Might there be a need for explicit methods that offer different ways of handling the memory vs. performance tradeoff?

I look forward to hearing your thoughts.

Thanks, Pete

# Setup
suppressPackageStartupMessages(library(HDF5Array))
suppressPackageStartupMessages(library(microbenchmark))
suppressPackageStartupMessages(library(profmem))

# Simulate some data
nrow <- 500000
ncol <- 6
x <- writeHDF5Array(matrix(runif(nrow * ncol), nrow = nrow, ncol = ncol))
n <- 10000
i <- sample(nrow(x), n)

# Functions
loadAll <- function(x, i) {
    as.matrix(x)
}
loadAllThenSubset <- function(x, i) {
    as.matrix(x)[i, ]
}
singleBracketSubset <- function(x, i) {
    as.matrix(x[i, ])
}
singleBracketWithSortSubset <- function(x, i) {
    o <- order(i)
    as.matrix(x[i[o], ])[order(o), ]
}
linearSingleBracketSubset <- function(x, i) {
    i2 <- DelayedArray:::to_linear_index(list(i, NULL), dim(x))
    matrix(x[i2], ncol = ncol(x))
}
linearSingleBracketSubsetWithSort <- function(x, i) {
    o <- order(i)
    i2 <- DelayedArray:::to_linear_index(list(i, NULL), dim(x))
    o2 <- order(i2)
    matrix(x[i2[o2]], ncol = ncol(x))[order(o), ]
}
extractArray <- function(x, i) {
    Nindex <- list(i, NULL)
    extract_array(x, Nindex)
}

# Check all methods get same result
all.equal(loadAllThenSubset(x, i), singleBracketSubset(x, i))
#> [1] TRUE
all.equal(loadAllThenSubset(x, i), singleBracketWithSortSubset(x, i))
#> [1] TRUE
all.equal(loadAllThenSubset(x, i), linearSingleBracketSubset(x, i))
#> [1] TRUE
all.equal(loadAllThenSubset(x, i), linearSingleBracketSubsetWithSort(x, i))
#> [1] TRUE
all.equal(loadAllThenSubset(x, i), extractArray(x, i))
#> [1] TRUE

# Benchmark running times
microbenchmark(
    loadAll(x, i),
    loadAllThenSubset(x, i),
    singleBracketSubset(x, i),
    singleBracketWithSortSubset(x, i),
    linearSingleBracketSubset(x, i),
    linearSingleBracketSubsetWithSort(x, i),
    extractArray(x, i),
    times = 10)
#> Unit: milliseconds
#>                                     expr        min         lq       mean
#>                            loadAll(x, i)   210.8005   226.0236   267.8243
#>                  loadAllThenSubset(x, i)   229.3803   236.5944   281.0997
#>                singleBracketSubset(x, i) 15809.3548 15956.4365 16661.4532
#>        singleBracketWithSortSubset(x, i) 15948.9005 16023.4474 16956.4327
#>          linearSingleBracketSubset(x, i)   753.1962   794.5452   835.2245
#>  linearSingleBracketSubsetWithSort(x, i)   736.4124   753.4237   799.8785
#>                       extractArray(x, i) 15911.2468 16203.3846 16754.2316
#>      median         uq        max neval cld
#>    247.3354   276.3357   382.3206    10  a 
#>    243.5460   346.0734   391.9918    10  a 
#>  16098.7406 16440.0989 19399.0879    10   b
#>  16586.2331 17295.1056 19703.1299    10   b
#>    808.5406   893.2855   958.5585    10  a 
#>    786.1045   818.1867   929.4077    10  a 
#>  16395.2204 16658.6316 19342.9889    10   b

# Benchmark total memory allocations
total(profmem(loadAll(x, i)))
#> [1] 88592128
total(profmem(loadAllThenSubset(x, i)))
#> [1] 89069664
total(profmem(singleBracketSubset(x, i)))
#> [1] 5531216
total(profmem(singleBracketWithSortSubset(x, i)))
#> [1] 5547480
total(profmem(linearSingleBracketSubset(x, i)))
#> [1] 425103376
total(profmem(linearSingleBracketSubsetWithSort(x, i)))
#> [1] 422994840
total(profmem(extractArray(x, i)))
#> [1] 5888816

Created on 2018-03-28 by the reprex package (v0.2.0).

Session info

``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R Under development (unstable) (2018-03-05 r74359) #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> tz America/New_York #> date 2018-03-28 #> Packages ----------------------------------------------------------------- #> package * version date source #> backports 1.1.2 2017-12-13 CRAN (R 3.5.0) #> base * 3.5.0 2018-03-06 local #> BiocGenerics * 0.25.3 2018-02-09 Bioconductor #> BiocParallel * 1.13.3 2018-03-23 Bioconductor #> codetools 0.2-15 2016-10-05 CRAN (R 3.5.0) #> compiler 3.5.0 2018-03-06 local #> datasets * 3.5.0 2018-03-06 local #> DelayedArray * 0.5.22 2018-03-02 Bioconductor #> devtools 1.13.5 2018-02-18 CRAN (R 3.5.0) #> digest 0.6.15 2018-01-28 CRAN (R 3.5.0) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.5.0) #> graphics * 3.5.0 2018-03-06 local #> grDevices * 3.5.0 2018-03-06 local #> grid 3.5.0 2018-03-06 local #> HDF5Array * 1.7.9 2018-03-02 Bioconductor #> htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0) #> IRanges * 2.13.28 2018-02-24 Bioconductor #> knitr 1.20 2018-02-20 CRAN (R 3.5.0) #> lattice 0.20-35 2017-03-25 CRAN (R 3.5.0) #> magrittr 1.5 2014-11-22 CRAN (R 3.5.0) #> MASS 7.3-49 2018-02-23 CRAN (R 3.5.0) #> Matrix 1.2-12 2017-11-20 CRAN (R 3.5.0) #> matrixStats * 0.53.1 2018-02-11 CRAN (R 3.5.0) #> memoise 1.1.0 2017-04-21 CRAN (R 3.5.0) #> methods * 3.5.0 2018-03-06 local #> microbenchmark * 1.4-4 2018-01-24 CRAN (R 3.5.0) #> multcomp 1.4-8 2017-11-08 CRAN (R 3.5.0) #> mvtnorm 1.0-7 2018-01-25 CRAN (R 3.5.0) #> parallel * 3.5.0 2018-03-06 local #> profmem * 0.5.0 2018-01-30 CRAN (R 3.5.0) #> Rcpp 0.12.16 2018-03-13 CRAN (R 3.5.0) #> rhdf5 * 2.23.6 2018-02-15 Github (Bioconductor/rhdf5@f452f9e) #> Rhdf5lib 1.1.5 2018-01-11 Bioconductor #> rmarkdown 1.9 2018-03-01 CRAN (R 3.5.0) #> rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0) #> S4Vectors * 0.17.38 2018-03-28 Bioconductor #> sandwich 2.4-0 2017-07-26 CRAN (R 3.5.0) #> splines 3.5.0 2018-03-06 local #> stats * 3.5.0 2018-03-06 local #> stats4 * 3.5.0 2018-03-06 local #> stringi 1.1.7 2018-03-12 CRAN (R 3.5.0) #> stringr 1.3.0 2018-02-19 CRAN (R 3.5.0) #> survival 2.41-3 2017-04-04 CRAN (R 3.5.0) #> TH.data 1.0-8 2017-01-23 CRAN (R 3.5.0) #> tools 3.5.0 2018-03-06 local #> utils * 3.5.0 2018-03-06 local #> withr 2.1.2 2018-03-15 CRAN (R 3.5.0) #> yaml 2.1.18 2018-03-08 CRAN (R 3.5.0) #> zoo 1.8-1 2018-01-08 CRAN (R 3.5.0) ```

demuellae commented 5 years ago

Thanks Pete for pointing this out. Very useful to know. Any news from the DelayedArray developers on this?

I would also like to note, that the linear indexing causes trouble when an index is larger than .Machine$integer.max:

Error: subscript contains NAs In addition: Warning message: In NSBS(i, x, exact = exact, strict.upper.bound = !allow.append, : NAs introduced by coercion to integer range

I am assuming because indices larger than .Machine$integer.max get converted into NAs somewhere downstream (sorry, have not yet been able to find out where exactly)

hpages commented 5 years ago

@PeteHaitch Great analysis Pete. as.matrix(x[i, ]) is intended to be the canonical way to load a random sample of rows into memory, and, as such, is expected to be fast. Unfortunately your analysis shows that it's not as fast as it should be. I'll look into this.

@demuellae Good point about linear indexing not supporting an index larger than .Machine$integer.max yet. Adding this to the TODO list.

hpages commented 5 years ago

@PeteHaitch I just made a simple change to h5mread() that addresses this.

Here are the timings I was getting before that change (HDF5Array 1.11.10) on my laptop:

Unit: milliseconds
                                    expr       min        lq      mean
                           loadAll(x, i)  118.5705  120.0589  126.6097
                 loadAllThenSubset(x, i)  115.0670  119.5294  142.3957
               singleBracketSubset(x, i) 4826.6771 4932.9701 5092.7289
       singleBracketWithSortSubset(x, i) 4780.0357 4893.8078 5126.6864
         linearSingleBracketSubset(x, i)  137.1954  142.8483  162.3793
 linearSingleBracketSubsetWithSort(x, i)  140.8591  142.0944  152.5863
                      extractArray(x, i) 4894.6306 4954.9751 5131.6610
    median        uq       max neval
  125.3347  128.1937  144.4930    10
  123.7600  135.5741  294.1190    10
 4975.2277 5073.6786 5672.9008    10
 4992.6848 5250.4584 5891.6236    10
  145.2787  149.7091  320.9898    10
  154.1817  156.8348  176.0262    10
 5138.1265 5211.9729 5618.9588    10

Consistent with yours.

And here are the timings I get after this change (HDF5Array 1.11.11):

Unit: milliseconds
                                    expr      min       lq     mean   median
                           loadAll(x, i) 129.6935 133.9096 196.8221 136.2655
                 loadAllThenSubset(x, i) 129.7976 132.2286 134.2168 133.6906
               singleBracketSubset(x, i) 111.3307 113.8825 117.2072 117.4472
       singleBracketWithSortSubset(x, i) 112.0190 113.0960 115.3418 114.2827
         linearSingleBracketSubset(x, i) 144.8725 154.0804 250.3987 159.3883
 linearSingleBracketSubsetWithSort(x, i) 147.2243 147.6458 154.1183 153.9714
                      extractArray(x, i) 110.1772 111.3872 114.4275 113.4713
       uq      max neval
 145.1341 442.3373    10
 137.2730 138.7781    10
 120.6242 122.0970    10
 116.9427 120.6199    10
 470.1006 479.0221    10
 158.0792 163.7452    10
 114.8213 126.3528    10

So this brings back as.matrix(x[i, j]) as the recommended way to load random subsets of data into memory.

@demuellae I'm going to move the issue about linear indexing not supporting an index larger than .Machine$integer.max to a new issue.

PeteHaitch commented 5 years ago

That's great! Thanks @hpages!

Bioconductor / DelayedArray

Improving performance of loading random subsets of data into memory #13