LTLA / batchelor

Clone of the Bioconductor repository for the batchelor package.
https://bioconductor.org/packages/devel/bioc/html/batchelor.html
16 stars 7 forks source link

Problem with mixed DelayedMatrix & dgCMatrix data type #31

Open ycl6 opened 3 years ago

ycl6 commented 3 years ago

I have a list of SingleCellExperiment objects from 10X and non-10X experiments, and the counts and logcounts assays have a mixture of DelayedMatrix (10X) & dgCMatrix (non-10X) data types.

In such case I noticed fastMNN will go into some kind of never-ending loop and does not complete even after a long long time. However, fastMNN will run perfectly and very quickly If I specifically changed the logcounts assays from DelayedMatrix to dgCMatrix so that all the inputs have the dgCMatrix data type.

R version 4.0.3 (2020-10-10)
batchelor_1.6.3
LTLA commented 3 years ago

I assume that you actually have TENxMatrix objects. (seed(assay(sce)) should tell you the type of the seed.)

If so, the difference you observe makes sense. The TENxMatrixSeed objects are file-backed; they do not hold any data in memory, but rather, fetch data from file on request. This is moderately-to-very slow depending on how fast your disk is. By comparison, the dgCMatrix is completely in memory so access is much faster, albeit at the cost of using more memory.

If you have memory to spare, then conversion to dgCMatrix objects is the correct approach. The file-backed representations are intended for very large matrices that do not fit easily into memory. In such cases, it is necessary to choose PCA algorithms that involve fewer reads from disk - for example, randomized PCA, as used here.

Mind you, I don't use the 10X HDF5 matrices much, so they may also just be inherently slow. Column access should be fairly efficient but row access will be pretty painful; that's just how the compressed-sparse-column format works.

ycl6 commented 3 years ago

Hi @LTLA

seed(assay(sce) confirms the objects from 10X experiments are sparse matrix of class TENxMatrixSeed and type "integer". The SingleCellExperiment objects are created using the read10xCounts function and HDF5 files, so DelayedMatrix would be the data type.

I manually put together the SingleCellExperiment object for the non-10X dataset using SingleCellExperiment function. In this case, the read count matrix was converted to dgCMatrix data type and given to the assays param.

The fastMNN function will run normally when the data types of all the input objects are of the same type, i.e. fastMNN runs smoothly when the logcounts (which fastMNN uses) are either DelayedMatrix or dgCMatrix type. It will failed to complete only when the inputs have a mixture of DelayedMatrix and dgCMatrix types.