Data centering before running NMF

ysbioinfo commented 5 months ago

Hi GeneNMF developers, Thanks for creating such an awesome tool. I am now trying to use it on my single cell data. I have a question about data processing before NMF. According to Gavish et al (2022 Nature), the count matrix was first log-normalized by the library size and then centering on each gene (i.e. df - rowMeans(df)). The negative values caused by centering were set to zero. Then, NMF was performed. This process seems quite weird, because the centering step will lead to ~50% negative values, which means ~50% information will be lost in downstream analysis. Later I found that almost all work from Itay Tirosh do NMF in this way, like PMID: 38653236, PMID: 29198524, PMID: 37258682. I know you guys are experts on NMF and what I want to know are:

Does GeneNMF centralize data before running NMF?
Why all Tirosh's work prefer centering before NMF? What is the benefit of centering? And it's also strange that they only do centering without scaling.
Centralization or not, which approach do you recommend?

Thanks in advance. Yang

mass-a commented 5 months ago

Hello Yang, thank you for you comments and interest in the tool. The short answer is yes, GeneNMF also does by default centering of the data (center and scale parameters to multiNMF()). We decided to implement it this way in the first version of GeneNMF simply to align it with what appears to be the standard practice in several seminal papers (not only those you mentioned, but also e.g. PMID: 35931863). My intuition is that this centering and removal of negative values hampers the effect of very lowly expressed genes, increasing the sparsity of the data matrix and in a way forcing the factorization to focus on highly expressed genes. But I definitely agree with you that this is not necessarily optimal, and there are probably better ways to prepare the data prior to NMF. This is one of several aspects where we plan to improve GeneNMF in the next iterations of the method.

Best -m

ysbioinfo commented 5 months ago

Hi Mass, Thanks for your answering. Your intuition quite makes sense. And I just realized that single cell data is so sparse, so centering and removing negative values might not lose too much information. Look forward to your further improvement on GeneNMF.

Best Yang

carmonalab / GeneNMF

Data centering before running NMF #7