Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

`NSBS` method for `DelayedArray` objects? #108

Closed LTLA closed 11 months ago

LTLA commented 11 months ago

While writing https://github.com/LTLA/ParquetDataFrame, it occurred to me that it would be nice to use a 1-dimensional DelayedArray (containing integer indices or logical filters) for subsetting BioC data structures:

library(SummarizedExperiment)
library(DelayedArray)

se <- SummarizedExperiment(matrix(rnorm(1000), 100, 10))
i <- DelayedArray(array(1:10))
se[i,]
## Error in (function (classes, fdef, mtable)  : 
##   unable to find an inherited method for function ‘NSBS’ for signature ‘"DelayedArray"’

df <- DataFrame(whee=1:1000)
df[i,]
## Error in (function (classes, fdef, mtable)  : 
##   unable to find an inherited method for function ‘NSBS’ for signature ‘"DelayedArray"’

Being able to do this would be convenient as my ParquetDataFrame returns 1-dimensional file-backed DelayedArrays representing the columnar data. So, if a NSBS method were available, it would allow users to do something like:

library(ParquetDataFrame)
df <- ParquetDataFrame(path_to_parquet_file)
keep <- df$foo > 1 & df$bar < -10 # this is a DelayedArray filter
sub <- df[keep,]

... without having to remember to call as.vector(keep) before it goes into NSBS via the DataFrame's [.

(Technically, it seems most appropriate to define a NSBS method for a hypothetical DelayedVector class that can be used as a subscripting vector, rather than pretending to have a vector via a 1-dimensional array. Certainly if a DelayedVector were available, my ParquetColumnVector would just derive from it. Probably could make a SQLColumnVector as well.)

Session information ``` R version 4.3.1 Patched (2023-08-28 r85047) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.6 LTS Matrix products: default BLAS: /home/luna/Software/R/R-4-3-branch/lib/libRblas.so LAPACK: /home/luna/Software/R/R-4-3-branch/lib/libRlapack.so; LAPACK version 3.11.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/Los_Angeles tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DelayedArray_0.27.10 SparseArray_1.1.12 [3] S4Arrays_1.1.6 abind_1.4-5 [5] Matrix_1.6-1.1 SummarizedExperiment_1.31.1 [7] Biobase_2.61.0 GenomicRanges_1.53.1 [9] GenomeInfoDb_1.37.6 MatrixGenerics_1.13.1 [11] matrixStats_1.0.0 IRanges_2.35.2 [13] S4Vectors_0.39.2 BiocGenerics_0.47.0 loaded via a namespace (and not attached): [1] zlibbioc_1.47.0 lattice_0.21-9 GenomeInfoDbData_1.2.10 [4] XVector_0.41.1 RCurl_1.98-1.12 bitops_1.0-7 [7] grid_4.3.1 compiler_4.3.1 tools_4.3.1 [10] crayon_1.5.2 ```
hpages commented 11 months ago

I would just add an NSBS,ANY method in S4Vectors that accepts any array-like subscript i with a single dimension, and replaces it with as.vector(i). Would handle 1D subscripts of other types like 1D SparseArray etc... Would that work?

Great that you're working on a ParquetDataFrame container!

LTLA commented 11 months ago

Would that work?

Yes, I think that would be a good solution.

hpages commented 11 months ago

Added to S4Vectors 0.39.3: https://github.com/Bioconductor/S4Vectors/commit/15349ef40f141b16df6daf3e38f3782ef54eb60c