bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

incorporate with batch-effect correction methods #92

Open Feilijiang opened 5 months ago

Feilijiang commented 5 months ago

Hi Ben,

Thank you for the awesome tool. I have recently started using it to analyze my datasets. However, I am encountering strong batch effects in my data. I was wondering if you have any plans to incorporate Harmony or other batch-effect correction methods into the tool. This feature would be incredibly helpful, especially for large datasets.

Please forgive me if I have overlooked something. Many thanks, and I look forward to your reply.

bnprks commented 5 months ago

Hi @Feilijiang, this is a good question!

Many batch-correction methods such as Harmony operate on the PCA matrix, not the full RNA counts matrix. Disk-backed calculations with BPCells can be quite helpful for calculating the PCA matrix, but once you have the PCA matrix disk-backed operations are usually not required and you can use tools that work fully in-memory. (The memory usage in R should be about 400MB of memory per million cells assuming 50 PCs, meaning you could handle a 20M cell dataset with just 8GB of RAM to store the PCs)

The good news here is that Harmony already accepts a PCA matrix as input. The examples in the Harmony docs show you can run harmony directly on a PCA matrix as follows:

harmony_object <- HarmonyMatrix(pca_matrix, meta_data, 'dataset',
                                    do_pca=FALSE, return_object=TRUE)

BPCells is unlikely to implement wrapper functions around Harmony since I want to keep the BPCells functionality focused on disk-backed operations, but I'd definitely consider putting up tutorials showing how to use Harmony at the end of a BPCells workflow.

In summary, I'd suggest that you do normalization + PCA using BPCells, then use the Harmony package directly for batch correction once you have the PCA matrix.

If you need help getting to a PCA, I'd suggest either following the steps in the BPCells tutorial or using Seurat's wrappers around BPCells