carmonalab / GeneNMF

Methods to discover gene programs on single-cell data
71 stars 3 forks source link

Pre-selecting genes prior to NMF decomposition #9

Open tjbencomo opened 3 months ago

tjbencomo commented 3 months ago

Hi - great piece of software and glad this NMF approach, now used in many papers, has finally been made into an easy to use API.

I noticed that the multiNMF function has a argument nfeatures which specifies how many genes to use when running NMF. It looks like this function finds a set of HVGs using Seurat's FindVariableFeatures function and then filters the number of HVGs to meet the nfeatures criteria defined by the user.

This was interesting to me as I believe in the Tirosh paper from Gavish et al 2024 (and others from that group), they do not use HVGs to pre-select genes before NMF, but instead only retain the top 7,000 genes with the highest average expression (see Gene Filtering section in Gavish 2024).

When I previously played around with Gavish's approach, picking a subset of genes based on HVGs vs average expression for NMF analysis seemed to produce significantly different programs. To the best of my knowledge, HVG selection usually uses some variance metric rather than average expression metric.

Can you comment on your decision to use an HVG approach to subset the expression matrix rather than average expression? I would be curious to know if you compared the two approaches, or even considered not filtering out any of the genes before NMF analysis (I know the RcppML paper mentions that their speed improvements mean the entire gene set (36k genes can be analyzed).

mass-a commented 3 months ago

Hello Tomas, thanks a lot for your interest in our tool.

I admit we didn't thoroughly evaluate different strategies for feature selection before NMF decomposition, so it's possible that the current approach based on highly variable genes (HVG) may not be the optimal solution. The rationale for going with HVGs comes from other basic analyses for single-cell data (e.g. dimensionality reduction, clustering, ...) where HVGs have been shown to be more informative than just the most highly expressed genes. The intuition is that if a gene is highly expressed in all of your cells, it won't help you distinguish interesting biological variability in the data. Ideally one would want to provide the full data matrix and let the method choose the relevant genes, but in our empirical experience it's beneficial to put the algorithm on the right track with a reasonable feature selection step prior to factorization. Again, this (and other parameters) should be carefully evaluated as we continue to develop GeneNMF. Thanks again for your input!