High memory usage when using assay() on large RaggedExperiments

Bioconductor / RaggedExperiment

Matrix-like representations of mutation and CN data

https://bioconductor.org/packages/RaggedExperiment

4 stars 3 forks source link

High memory usage when using assay() on large RaggedExperiments #25

Open biobenkj opened 4 years ago

biobenkj commented 4 years ago

RaggedExperiment continues to rule for all our 'omics related work! I did notice something interesting yesterday when running compactSummarizedExperiment(), when I attempt to access the names of the assays in a large RE

# RaggedExperiment in question
> aaml
class: RaggedExperiment
dim: 36019710 1401
assays(2): pc compartments
rownames: NULL
colnames(1401): 813584_Dx 814465_Dx ... RO02776B RO02815
colData names(99): Timepoint Gender ... MLLT10 KMT2A

#size
> object_size(aaml)
1.01 GB

#names access
#high memory usage (100s of GB)
names(assay(aaml))

#names access
#near instant
assayNames(aaml)

it will either be near instantaneous with using assayNames(), or require 100s of GB of memory with names(assay(my_RE)). Do you know why this might be the case? I'll work on getting a smaller reproducible example if there is interest.

Thanks again for all that you do and RaggedExperiments!

mtmorgan commented 4 years ago

I believe, without actually checking, that the names are stored independently of the underlying data representation, and the cost is associated with adding names and hence duplicating the underlying data. If it's 'easy' to simulate the data for a reproducible example that would be great.

LiNk-NY commented 4 years ago

Hi Ben, @biobenkj I'm glad to hear you are making use of this data representation! The trick behind RaggedExperiment involves providing a matrix representation from a GRangesList object. In the background, the stored representation is a GRangesList so accessing the metadata it relatively straightforward. When using assay, the GRangesList representation has to be converted to matrix, this involves creating quite a large sparse matrix from the mcols in the original GRangesList, a costly operation. I agree, a minimal and reproducible example would be helpful. We'll see what we can do to increase the efficiency of this conversion. Thank you.

LiNk-NY commented 3 years ago

@biobenkj Any updates on this? Would a dgCMatrix representation help? Have you tested this? We can create additional functionality to return this data representation. If you can provide a reproducible example to help this move along, that would be great. Thanks!