Open scottgigante opened 4 years ago
Probably it's hitting beachmat code in C++, which doesn't know anything about dgTMatrix
and dgRMatrix
. All such "unknown" matrices are instead handled via block processing, which realizes blocks of the matrix as ordinary arrays with an upper memory usage defined by DelayedArray::getAutoBlockSize()
. This defaults to 100 MB.
There are no plans to provide native support for dgTMatrix
and dgRMatrix
. The former can't be accessed efficiently and R-level operations convert them to dgCMatrix
anyway. As for the latter, I have never seen it used in real analyses.
You may have never seen it but it's the reason why my jobs are failing, hence this issue report :) Seems like a fairly trivial fix to include a
if (is(X, "sparseMatrix")) X <- as(X, "CsparseMatrix")
rather than silently coerce the matrix to dense, hope the user notices and fixes it themself.
It's not like the entire matrix is being coerced to dense form. It's bounded by a 100 MB limit, as defined by getAutoBlockSize
; your jobs should be able to handle that. The same code is used to handle all unknown matrices, e.g., RleMatrix
, ResidualMatrix
, various other forms of DelayedMatrix
representations: I don't see why dgTMatrix
es and dgRMatrix
es should get special treatment here.
In this toy example, the dgTMatrix also takes ~100x longer than the dgCMatrix. Seems like a good reason to me.
One could say that about many matrix formats.
For example, I could store a sparse matrix in a HDF5Matrix
, and under this reasoning, the function would be expected to convert it to a dgCMatrix
for further processing. This is not a decision that the function should be allowed to make - for example, it would not be aware of memory constraints that motivated the use of the HDF5Matrix
in the first place.
In the specific case of the dgTMatrix
, an automated conversion is unwise if there are specific reasons for storing it as a dgTMatrix
instead of a dgCMatrix
. The most obvious is that the latter fails due to integer overflow in its p
vector when there are more than ~2e9 non-zero elements in the matrix. The former stays operational at the cost of speed and memory.
Given that you're talking about toy examples, the easiest solution would just be to convert your matrix to the desired format. scran won't do any conversion, that is not its decision to make. For real analyses, all ingestion pipelines that I use will produce a dgCMatrix
directly so I don't think this lack of automated conversion has much practical consequence.
Some formats of sparse matrix are causing massive allocation of memory. dgCMatrix works perfectly, using only a total of 2MB:
while dgTMatrix allocates a whopping 65MB for the same operation
and the same issue for dgRMatrix, allocating ~40MB.
Session info: