dylkot / cNMF

Code and example data for running Consensus Non-negative Matrix Factorization on single-cell RNA-Seq data
MIT License
265 stars 57 forks source link

Discrepancies in 'k' Output: Command Line vs. Python Environment #75

Open mluciarr opened 11 months ago

mluciarr commented 11 months ago

Hi Dylan!

Yesterday I ran the cNMF in the terminal (Mac M1) and everything went smoothly until the final step, where I encountered an unusual error:

(py37) Mac-Studio-de-Maria:tables marialuciaromerorivero$ cnmf k_selection_plot --output-dir ./1_cNFM_trials --name 1_cNFM_trials`

(py37) Mac-Studio-de-Maria:tables marialuciaromerorivero$ cnmf consensus --output-dir ./1_cNFM_trials --name 1_cNFM_trials --components 11 --local-density-threshold 2 --show-clustering

Traceback (most recent call last):
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/bin/cnmf", line 8, in <module>
    sys.exit(main())
 File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/cnmf/cnmf.py", line 1050, in main
    close_clustergram_fig=True)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/cnmf/cnmf.py", line 765, in consensus
spectra_tpm = self.refit_spectra(tpm.X, norm_usages.astype(tpm.X.dtype))
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/cnmf/cnmf.py", line 680, in refit_spectra
    return(self.refit_usage(X.T, usage.T).T)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/cnmf/cnmf.py", line 658, in refit_usage
    _, rf_usages = self._nmf(X, nmf_kwargs=refit_nmf_kwargs)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/cnmf/cnmf.py", line 551, in _nmf
    (usages, spectra, niter) = non_negative_factorization(X, **nmf_kwargs)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py", line 1107, in non_negative_factorization
    W, H, n_iter = est._fit_transform(X, W=W, H=H, update_H=update_H)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py", line 1597, in _fit_transform
    W, H = self._check_w_h(X, W, H, update_H)
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py", line 1470, in _check_w_h
    _check_init(H, (self._n_components, n_features), "NMF (input H)")
  File "/Users/marialuciaromerorivero/miniconda3/envs/py37/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py", line 58, in _check_init
    % (whom, shape, np.shape(A))
ValueError: Array with the wrong shape passed to NMF (input H). Expected (11, 224758), but got (11, 224759) 

For that reason, I ran it in Python, which worked perfectly with the exception that the k_selection_plot to select the optimal 'k' is completely different from what I obtained in the terminal. I used exactly the same parameters in both methodologies, and the plots look completely different. Here, I will show them to you:

Terminal k_selection_plot: shows the optimal 'k' as 11 1_cNFM_trials k_selection

Python environment k_selection_plot: shows the optimal 'k' as 7

1_cNFM_trials_comp7_py k_selection

1) Why is there such a significant difference in the results? 2) Which of them should I trust?

Thank you very much in advance! Looking forward to your reply :)

Lucia

LiuCanidk commented 5 months ago

@mluciarr sorry to interrupt, could you tell me how to determine the optimal value based on the plotting result? It seems that the first peak in the stability curve? image For me to consider the trade-off between error and stability, should I choose 5, 6 or 7 as the optimal value? The plot above is my result of a bulk RNA dataset

mluciarr commented 5 months ago

Hi @LiuCanidk ,

Well, looking at your results I wouldn't be sure with one of the 3 options you mention. If I were you I would run it using the 3 resolutions and see which one fits best with what you want to see or what you consider that makes more sense with your data. I would start using 5 because it's the one that is slightly higher and then see what happens if you increase the number, I bet that the results won't change significantly since the 3 are consecutive. I may not be helping you much but in your case is quite tricky to confirm the optimal number of factors. Let me know if you try it and you see many differences. I'm intrigued about it.

Regards.

Lucia

LiuCanidk commented 5 months ago

Hi @mluciarr Thanks for your reply. I do not try many K values, but pick the 7 as the ultimate and found 7 is OK. Although I cannot show you the different results, but I found other differences in my another single cell project using cNMF. As the selection plot shows, error definitely decreases as the k values become larger, but stability's trend is not monotonous. So I think the key is to select the optimal stability while maintaining lower error.

image The selection plot above showed that k=3 has highest stability and also lower error than k=2; k=5 has the first increased trend of stability and also lower error; k=7, 10 all have the increased trend of stability compared to their last value. So I believe the key is to select the increased trend (non-monotonous). Then I pick 3, 5, 7, 10 to see the difference of their clustering plots.

k=3 image k=5 image k=7 image k=10 image As you can see, k=3 formed the most robust clusters. k=5 also worked fine (after setting the parameter of filtering threshold). k=7 has slightly discrepancies across iterations or spectras (n.iter=100, although here 19% were filtered, I found the default is to filter 30% mentioned in the tutorial) and k=10 has the worst clustering result, which is obviously non-acceptable (I did change the parameter of local-threshold-density, but see no difference of this discrepancy pattern ,though it can occur in different clusters). So here I choose k=5 and k=7, and perform GSEA to annotate these gene expression programs to see if they are proper.

Downstream results were not shown here because the non-ideal GSEA result and also the program usage distribution across cells, which I guess was the problem on the single cell quality. Hope it helps.

Thank you again for your advice.

dylkot commented 4 months ago

Yes, I'll just add that choosing K is hard and I recommend to look at the results for a few values of K (as you would do with clustering). Usually only one or two GEPs change at the margin while the majority remain pretty stable. So I recommend exploring what GEPs are changing with the different values of K. I also think GSEA can only help to some extent because often the gene sets available to analyze the programs don't actually tell us what the programs are. So I also recommend looking at the top weighted genes in the gene_spectra_score output.