Working with large matrices

malosreet commented 4 years ago

Hello,

I am interested in using scConsensus to combine supervised (Seurat label transfer) and unsupervised (Seurat graph-based clustering) results into a consensus clustering.

I have run into two issues. Firstly, when trying to run reclusterDEConsensus, I received a "problem too large" error. I suspect that this may be because of a step in the function which converts the sparse matrix to a dense matrix before differential gene expression analysis. I worked around this issue by inputting a subset of the most variable genes (10,000 out of ~30,000 genes) into the reclusterDEConsensus function. This solution seems to have circumvented this particular problem.

Currently, I am getting another error generated later in the processing by hclust, as follows:

Error in stats::hclust(d, method = "ward.D2") :
  size cannot be NA nor exceed 65536

This is most likely because there are around 80000 cells in my matrix which is over the limit of 65536.

I was wondering if you have any suggestions for scaling up your functions for my dataset.

Thank you!

prabhakarlab commented 4 years ago

Hello @malosreet, thank you for trying out scConsensus! We are currently working on improving scalability of both the DE gene caller and the clustering function, and will be providing a graph-based alternative for larger datasets soon.

Meanwhile, could I request that you downsample your dataset using the SubsetData() function in Seurat by setting an appropriate value for the max.cells.per.ident parameter using the output of plotContingencyTable() as your identity class? This downsampled dataset can then be used to run reclusterDEConsensus().

malosreet commented 4 years ago

Hello @prabhakarlab,

Thank you for your suggestion. I was able to move past the hclust error by modifying line #302 of scConsensus.R as follows:

cellTree = fastcluster::hclust(d, method = "ward.D2")

Simply using fastcluster's version of hclust seems to resolve the problem. I am still experiencing some issues with scalability in the following steps, but I am trying to resolve them.

Downsampling does not seem like an appropriate option for me as I would want the final cluster labels for all of the cells in my dataset. I look forward to seeing your solutions for scalability to larger datasets.

Malosree

bbbranjan / scConsensus

Working with large matrices #1