Open ycl6 opened 3 years ago
I assume that you actually have TENxMatrix
objects. (seed(assay(sce))
should tell you the type of the seed.)
If so, the difference you observe makes sense. The TENxMatrixSeed
objects are file-backed; they do not hold any data in memory, but rather, fetch data from file on request. This is moderately-to-very slow depending on how fast your disk is. By comparison, the dgCMatrix
is completely in memory so access is much faster, albeit at the cost of using more memory.
If you have memory to spare, then conversion to dgCMatrix
objects is the correct approach. The file-backed representations are intended for very large matrices that do not fit easily into memory. In such cases, it is necessary to choose PCA algorithms that involve fewer reads from disk - for example, randomized PCA, as used here.
Mind you, I don't use the 10X HDF5 matrices much, so they may also just be inherently slow. Column access should be fairly efficient but row access will be pretty painful; that's just how the compressed-sparse-column format works.
Hi @LTLA
seed(assay(sce)
confirms the objects from 10X experiments are sparse matrix of class TENxMatrixSeed
and type "integer". The SingleCellExperiment
objects are created using the read10xCounts
function and HDF5 files, so DelayedMatrix
would be the data type.
I manually put together the SingleCellExperiment
object for the non-10X dataset using SingleCellExperiment
function. In this case, the read count matrix was converted to dgCMatrix
data type and given to the assays
param.
The fastMNN
function will run normally when the data types of all the input objects are of the same type, i.e. fastMNN
runs smoothly when the logcounts
(which fastMNN
uses) are either DelayedMatrix
or dgCMatrix
type. It will failed to complete only when the inputs have a mixture of DelayedMatrix
and dgCMatrix
types.
I have a list of
SingleCellExperiment
objects from 10X and non-10X experiments, and thecounts
andlogcounts
assays have a mixture ofDelayedMatrix
(10X) &dgCMatrix
(non-10X) data types.In such case I noticed
fastMNN
will go into some kind of never-ending loop and does not complete even after a long long time. However,fastMNN
will run perfectly and very quickly If I specifically changed thelogcounts
assays fromDelayedMatrix
todgCMatrix
so that all the inputs have thedgCMatrix
data type.