dylkot / cNMF

Code and example data for running Consensus Non-negative Matrix Factorization on single-cell RNA-Seq data
MIT License
243 stars 57 forks source link

Batch correction #88

Closed RunyuXia closed 1 month ago

RunyuXia commented 1 month ago

Hi!

I've noticed that the output from the preprocessing step of the updated batch correction method only yields a counts h5ad file containing the highly variable genes. For my NMF analysis, I want to include as many genes as possible. Could you provide guidance on if it would be possible to retain batch-corrected counts for all genes?

Thanks!

dylkot commented 1 month ago

Hi @RunyuXia -- I think in theory it would work to set the number of variable genes equal to the number of genes in your dataset. The data is normalized in the process of batch correction, so this won't work if you want to preserve the data in units of counts as opposed to variance normalized counts. Overall I don't think this is recommended though. If there are low variance genes and you variance normalize them, they will contribute the same amount of signal to the principal components as the high variance genes so it will swamp out the signal in the data. If you are looking for a general batch correction approach that operates directly on counts without normalization, perhaps mutual nearest neighbors or COMBAT is what you are looking for.