Open Feilijiang opened 5 months ago
Hi @Feilijiang, this is a good question!
Many batch-correction methods such as Harmony operate on the PCA matrix, not the full RNA counts matrix. Disk-backed calculations with BPCells can be quite helpful for calculating the PCA matrix, but once you have the PCA matrix disk-backed operations are usually not required and you can use tools that work fully in-memory. (The memory usage in R should be about 400MB of memory per million cells assuming 50 PCs, meaning you could handle a 20M cell dataset with just 8GB of RAM to store the PCs)
The good news here is that Harmony already accepts a PCA matrix as input. The examples in the Harmony docs show you can run harmony directly on a PCA matrix as follows:
harmony_object <- HarmonyMatrix(pca_matrix, meta_data, 'dataset',
do_pca=FALSE, return_object=TRUE)
BPCells is unlikely to implement wrapper functions around Harmony since I want to keep the BPCells functionality focused on disk-backed operations, but I'd definitely consider putting up tutorials showing how to use Harmony at the end of a BPCells workflow.
In summary, I'd suggest that you do normalization + PCA using BPCells, then use the Harmony package directly for batch correction once you have the PCA matrix.
If you need help getting to a PCA, I'd suggest either following the steps in the BPCells tutorial or using Seurat's wrappers around BPCells
Hi Ben,
Thank you for the awesome tool. I have recently started using it to analyze my datasets. However, I am encountering strong batch effects in my data. I was wondering if you have any plans to incorporate Harmony or other batch-effect correction methods into the tool. This feature would be incredibly helpful, especially for large datasets.
Please forgive me if I have overlooked something. Many thanks, and I look forward to your reply.