Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

Reshape array to Matrix without reading in #16

Closed muschellij2 closed 3 years ago

muschellij2 commented 5 years ago

I would like to reshape an array, that is 3D or 4D to a matrix. This is for imaging data. For 3D data, we'd like to reshape simply to matrix with 1 column and for 4D matrix with number of columns of 4th dimension. So generally a matrix with number of columns of 4th dimension. Overall, I'd like to be able to "rehsape" the array. As you cannot assign dim to the array other than 1s for adding "empty" dimensions, I'm not really sure of the procedure.

I have an example that works, but I'm not sure it's a) memory efficient, or b) speed efficient. I'd like the array to be stored in the H5 file though, not the matrix version. Overall, the goal is to have data as 3D arrays (one for each person), and then be able to use DelayedMatrixStats from @PeteHaitch to work on it, then convert back to an array.

@avalcarcel9

Any input would be helpful. Here is a toy example:

library(HDF5Array)
#> Loading required package: DelayedArray
#> Loading required package: stats4
#> Loading required package: matrixStats
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter,
#>     Find, get, grep, grepl, intersect, is.unsorted, lapply, Map,
#>     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
#>     setdiff, sort, table, tapply, union, unique, unsplit, which,
#>     which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: IRanges
#> Loading required package: BiocParallel
#> 
#> Attaching package: 'DelayedArray'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
#> The following objects are masked from 'package:base':
#> 
#>     aperm, apply, rowsum
#> Loading required package: rhdf5
arr = array(rnorm(10^3), dim = c(10, 10, 10))
filepath = tempfile()
res = writeHDF5Array(arr, filepath = filepath)

vec_res = c(res)
dres = dim(res)
rl = pbapply::pblapply(seq(dres[3]), function(i) {
  xx = res[,,i]
  xx = lapply(seq(dres[2]), function(i) {
    xx[, i, drop = FALSE]
  })
  xx = do.call(DelayedArray::arbind, xx)
  xx
})
rl = do.call(DelayedArray::arbind, rl)

vec = c(rl)
vec_res = c(res)
corr_vec = c(arr)
all(vec_res == corr_vec)
#> [1] TRUE
all(vec == corr_vec)
#> [1] TRUE

Created on 2019-07-01 by the reprex package (v0.3.0)

hpages commented 5 years ago

Addressed here.