Open edridgedsouza opened 1 month ago
Hello and thanks for your message.
Currenty we don't have an implementation for on-disk processing, such as the one from BPCells. I agree we should start looking into supporting this kind strategies in UCell.
In these cases what I would do is to process one sample at a time (or batches of samples), as you also suggested. Note that, because UCell scores are calculated individually for each cell, the results should be identical whether you load one or all samples at the same time into memory - therefore there should be no concern about diverging results.
Best -m
Hi, I have an object with 1.1mil cells across 200 samples. I've used Seurat's BPCells method to minimize the amount of processing done in-memory. However, when I try to run
AddModuleScore_UCell
, I get the following error:Is there a way to bypass this conversion so that it doesn't break when it gets converted to an in-memory dgCMatrix? i.e. is there a way for UCell to work alongside very-large data that's stored on disk in the BPCells format?
The alternate solution for me is to split my object into 200 different samples, run UCell on each sample individually, and then combine the metadata results. While UCell is more robust than the default method to changes in dataset composition, the obvious downside to this method is that errors may snowball as you have increasingly more deviations from the full dataset.
What are the options for calculating UCell scores on massive datasets with on-disk processing?