carmonalab / UCell

Gene set scoring for single-cell data
GNU General Public License v3.0
132 stars 16 forks source link

Breaking for very large objects #40

Open edridgedsouza opened 1 month ago

edridgedsouza commented 1 month ago

Hi, I have an object with 1.1mil cells across 200 samples. I've used Seurat's BPCells method to minimize the amount of processing done in-memory. However, when I try to run AddModuleScore_UCell, I get the following error:

Error in (function (cond) : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Error converting IterableMatrix to dgCMatrix
• dgCMatrix objects cannot hold more than 2^31 non-zero entries
• Input matrix has 2736736780 entries

Is there a way to bypass this conversion so that it doesn't break when it gets converted to an in-memory dgCMatrix? i.e. is there a way for UCell to work alongside very-large data that's stored on disk in the BPCells format?

The alternate solution for me is to split my object into 200 different samples, run UCell on each sample individually, and then combine the metadata results. While UCell is more robust than the default method to changes in dataset composition, the obvious downside to this method is that errors may snowball as you have increasingly more deviations from the full dataset.

What are the options for calculating UCell scores on massive datasets with on-disk processing?

mass-a commented 1 month ago

Hello and thanks for your message.

Currenty we don't have an implementation for on-disk processing, such as the one from BPCells. I agree we should start looking into supporting this kind strategies in UCell.

In these cases what I would do is to process one sample at a time (or batches of samples), as you also suggested. Note that, because UCell scores are calculated individually for each cell, the results should be identical whether you load one or all samples at the same time into memory - therefore there should be no concern about diverging results.

Best -m