Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

HDF5Matrix subsetting loads subset into memory #18

Closed avalcarcel9 closed 4 years ago

avalcarcel9 commented 4 years ago

When subsetting an HDF5Matrix the subset is loaded into memory and returned.

As a simple example you can use the code below. The results returned to me are commented out. When subsetting using [ it seems that the object is being loaded into memory rather than remaining an HDF5Matrix.

x = as(matrix(1:40, ncol = 4), "HDF5Array")
class(x)
# [1] "HDF5Matrix"
# attr(,"package")
# [1] "HDF5Array"
x1 = x[,1]
class(x1)
# [1] "integer"

I am using the development version of the package for BioConductor 3.10. See session info below.

─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.1 (2019-07-05)
 os       macOS Mojave 10.14.5        
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/New_York            
 date     2019-08-20                  

─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package      * version date       lib source        
 assertthat     0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
 BiocGenerics * 0.31.5  2019-07-03 [1] Bioconductor  
 BiocParallel * 1.19.2  2019-08-07 [1] Bioconductor  
 cli            1.1.0   2019-03-19 [1] CRAN (R 3.6.0)
 clipr          0.7.0   2019-07-23 [1] CRAN (R 3.6.0)
 crayon         1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
 DelayedArray * 0.11.4  2019-07-03 [1] Bioconductor  
 HDF5Array    * 1.13.5  2019-08-06 [1] Bioconductor  
 IRanges      * 2.19.10 2019-06-11 [1] Bioconductor  
 lattice        0.20-38 2018-11-04 [1] CRAN (R 3.6.1)
 Matrix         1.2-17  2019-03-22 [1] CRAN (R 3.6.1)
 matrixStats  * 0.54.0  2018-07-23 [1] CRAN (R 3.6.0)
 packrat        0.5.0   2018-11-14 [1] CRAN (R 3.6.0)
 rhdf5        * 2.29.0  2019-05-02 [1] Bioconductor  
 Rhdf5lib       1.7.4   2019-07-30 [1] Bioconductor  
 rstudioapi     0.10    2019-03-19 [1] CRAN (R 3.6.0)
 S4Vectors    * 0.23.18 2019-08-16 [1] Bioconductor  
 sessioninfo  * 1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
 withr          2.1.2   2018-03-15 [1] CRAN (R 3.6.0)

[1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
LTLA commented 4 years ago

This is consistent with the subsetting of ordinary matrices, see Bioconductor/DelayedArray#6. If you want to preserve the original class of the matrix, use drop=FALSE.

avalcarcel9 commented 4 years ago

Thanks @LTLA ! This makes sense now. I agree with all the discussion in the original post at first its confusing that the data load into memory without explicitly forcing this but I also agree that avid users will more often need the data loaded after the subset.

I was following some documentation/instructions from a workshop found here in 15.1 Overview. You'll see in that section da_Rle[1:10,] is a subset called and the data did remain a DelayedMatrix rather than get loaded into memory. Is this is a property specific to RleMatrix and DelayedMatrix? Or maybe the user changed some profile settings to automatically use drop = FALSE? I've found this some of more thorough documentation on using the package and it made me think my realization was a bug.

hpages commented 4 years ago

@avalcarcel9

The result of subsetting is returned as an ordinary vector only when it is a mono-dimensional slice (e.g. x[ , 3]). In the example you are referring to da_Rle[1:10, ] has dimensions 10 x 2 so nothing gets dropped i.e. the result is still a DelayedMatrix object. But if you select a single row (e.g. with da_Rle[10, ]) or single column (e.g. da_Rle[ , 2]) then the result will be loaded in memory and returned as an ordinary vector.

Note that you can find a more recent version of the "Effectively using the DelayedArray framework to support the analysis of large datasets" workshop here: http://biocworkshops2019.bioconductor.org.s3-website-us-east-1.amazonaws.com/page/DelayedArrayWorkshop__Effectively_using_the_DelayedArray_framework_for_users/

H.

avalcarcel9 commented 4 years ago

Thanks @hpages for the clarification! I think this makes sense now! Also thanks for the updated documentation! I'm closing this!