Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

1.5-2x speed-up for as.matrix(HDF5Matrix) #1

Closed PeteHaitch closed 6 years ago

PeteHaitch commented 6 years ago

Hi Hervé,

Adding an as.matrix,HDF5Matrix-method gives a nice little speed-up over the current method of calling as.matrix,DelayedArray-method via inheritance:

suppressPackageStartupMessages(library(HDF5Array))
library(microbenchmark)

setMethod("as.matrix", "HDF5Matrix", function(x, ...) {
  HDF5Array:::.extract_array_from_HDF5ArraySeed(seed(x), unname(x@index))
})

nrow <- 1e6
ncol <- 10
x <- writeHDF5Array(matrix(runif(nrow * ncol), 
                           ncol = ncol),
                   level = 0)
x2 <- as(x, "DelayedMatrix")

microbenchmark(
  HDF5Matrix = as.matrix(x),
  DelayedMatrix = as.matrix(x2),
  times = 10)
#> Unit: milliseconds
#>           expr      min       lq     mean   median       uq      max neval
#>     HDF5Matrix 311.7019 367.5142 391.0141 372.2218 424.1370 496.5871    10
#>  DelayedMatrix 619.2046 636.2662 702.1828 684.6771 761.7285 847.4368    10
#>  cld
#>   a 
#>    b

class(as.matrix(x))
#> [1] "matrix"
all.equal(as.matrix(x), as.matrix(x2))
#> [1] TRUE
Session info ``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R Under development (unstable) (2017-10-30 r73642) #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> tz America/New_York #> date 2017-11-26 #> Packages ----------------------------------------------------------------- #> package * version date source #> backports 1.1.1 2017-09-25 CRAN (R 3.5.0) #> base * 3.5.0 2017-10-31 local #> BiocGenerics * 0.25.0 2017-10-31 Bioconductor #> codetools 0.2-15 2016-10-05 CRAN (R 3.5.0) #> colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0) #> compiler 3.5.0 2017-10-31 local #> datasets * 3.5.0 2017-10-31 local #> DelayedArray * 0.5.4 2017-11-24 Bioconductor #> devtools 1.13.4 2017-11-09 CRAN (R 3.5.0) #> digest 0.6.12 2017-01-27 CRAN (R 3.5.0) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.5.0) #> ggplot2 2.2.1 2016-12-30 CRAN (R 3.5.0) #> graphics * 3.5.0 2017-10-31 local #> grDevices * 3.5.0 2017-10-31 local #> grid 3.5.0 2017-10-31 local #> gtable 0.2.0 2016-02-26 CRAN (R 3.5.0) #> HDF5Array * 1.7.1 2017-11-24 Bioconductor #> htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0) #> IRanges * 2.13.4 2017-11-22 Bioconductor #> knitr 1.17 2017-08-10 CRAN (R 3.5.0) #> lattice 0.20-35 2017-03-25 CRAN (R 3.5.0) #> lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0) #> magrittr 1.5 2014-11-22 CRAN (R 3.5.0) #> MASS 7.3-47 2017-02-26 CRAN (R 3.5.0) #> Matrix 1.2-12 2017-11-15 CRAN (R 3.5.0) #> matrixStats * 0.52.2 2017-04-14 CRAN (R 3.5.0) #> memoise 1.1.0 2017-04-21 CRAN (R 3.5.0) #> methods * 3.5.0 2017-10-31 local #> microbenchmark * 1.4-2.1 2015-11-25 CRAN (R 3.5.0) #> multcomp 1.4-8 2017-11-08 CRAN (R 3.5.0) #> munsell 0.4.3 2016-02-13 CRAN (R 3.5.0) #> mvtnorm 1.0-6 2017-03-02 CRAN (R 3.5.0) #> parallel * 3.5.0 2017-10-31 local #> plyr 1.8.4 2016-06-08 CRAN (R 3.5.0) #> Rcpp 0.12.14 2017-11-23 CRAN (R 3.5.0) #> rhdf5 * 2.23.0 2017-10-31 Bioconductor #> rlang 0.1.4 2017-11-05 CRAN (R 3.5.0) #> rmarkdown 1.8 2017-11-17 CRAN (R 3.5.0) #> rprojroot 1.2 2017-01-16 CRAN (R 3.5.0) #> S4Vectors * 0.17.10 2017-11-23 Bioconductor #> sandwich 2.4-0 2017-07-26 CRAN (R 3.5.0) #> scales 0.5.0 2017-08-24 CRAN (R 3.5.0) #> splines 3.5.0 2017-10-31 local #> stats * 3.5.0 2017-10-31 local #> stats4 * 3.5.0 2017-10-31 local #> stringi 1.1.6 2017-11-17 CRAN (R 3.5.0) #> stringr 1.2.0 2017-02-18 CRAN (R 3.5.0) #> survival 2.41-3 2017-04-04 CRAN (R 3.5.0) #> TH.data 1.0-8 2017-01-23 CRAN (R 3.5.0) #> tibble 1.3.4 2017-08-22 CRAN (R 3.5.0) #> tools 3.5.0 2017-10-31 local #> utils * 3.5.0 2017-10-31 local #> withr 2.1.0 2017-11-01 CRAN (R 3.5.0) #> yaml 2.1.14 2016-11-12 CRAN (R 3.5.0) #> zlibbioc 1.25.0 2017-10-31 Bioconductor #> zoo 1.8-0 2017-04-12 CRAN (R 3.5.0) ```

We know a HDF5Matrix doesn't have any delayed ops (otherwise it'd be a DelayedMatrix) so I think this is safe.

For the same reasons I think we could also have an as.array,HDF5Array-method.

Cheers, Pete

PeteHaitch commented 6 years ago

Now I think about it some more, just defining as.array,HDF5Array-method might give us as.matrix,HDF5Matrix-method, albeit with a tiny bit of S4 method dispatch overhead.

hpages commented 6 years ago

Hi Pete,

After some initial investigation, it seems that the as.array and as.matrix methods for DelayedArray objects are slowed down by R copy-on-change mechanism being triggered even when nothing is changed e.g. in situations like dimnames(ans) <- NULL when ans has no dimnames. It's possible to work around this by testing before modifying e.g.

new_dimnames <- .get_DelayedArray_dimnames_before_transpose(x)
if (!identical(dimnames(ans), new_dimnames))
    dimnames(ans) <- new_dimnames

By doing this kind of change in several places in the as.array and as.matrix methods for DelayedArray objects, I think I can make them almost as fast as your as.matrix method for HDF5Matrix objects. I prefer this over introducing a dedicated method for HDF5Matrix objects as it will benefit as.array() and as.matrix() on all DelayedArray objects, not just on HDF5Matrix objects.

I'll try to commit something today and will let you know.

H.

grimbough commented 6 years ago

Thanks for looking into this. Just wanted to add that I'd found the same copying behaviour when setting the dimnames and come to the same solution e.g. 5c6937f and d61695b

PeteHaitch commented 6 years ago

Ah, I'd forgotten that solution you'd found, @grimbough. It'd be great if that means all DelayedArray subclasses get a speed-up, @hpages

hpages commented 6 years ago

Thanks @grimbough for confirming the copying behaviour!

I added set_dim() and set_dimnames() to the DelayedArray package (https://github.com/Bioconductor/DelayedArray/commit/53530e22dc880d92910de975e24c2a999181f117). These are dim<- and dimnames<- wrappers that use a do-not-touch-if-no-op approach. By using them everywhere in place of dim<- and dimnames<- in DelayedArray, I get minor speedups in various places. In particular this makes as.array() and as.matrix() as fast as direct extraction with extract_array() on a pristine DelayedArray object. So all is good :-)

PeteHaitch commented 6 years ago

Thanks, @hpages! Confirmed that this speed-up follows through to other DelayedArray subclasses, like fstMatrix in fstArray