Open mluciarr opened 1 year ago
@mluciarr sorry to interrupt, could you tell me how to determine the optimal value based on the plotting result? It seems that the first peak in the stability curve? For me to consider the trade-off between error and stability, should I choose 5, 6 or 7 as the optimal value? The plot above is my result of a bulk RNA dataset
Hi @LiuCanidk ,
Well, looking at your results I wouldn't be sure with one of the 3 options you mention. If I were you I would run it using the 3 resolutions and see which one fits best with what you want to see or what you consider that makes more sense with your data. I would start using 5 because it's the one that is slightly higher and then see what happens if you increase the number, I bet that the results won't change significantly since the 3 are consecutive. I may not be helping you much but in your case is quite tricky to confirm the optimal number of factors. Let me know if you try it and you see many differences. I'm intrigued about it.
Regards.
Lucia
Hi @mluciarr Thanks for your reply. I do not try many K values, but pick the 7 as the ultimate and found 7 is OK. Although I cannot show you the different results, but I found other differences in my another single cell project using cNMF. As the selection plot shows, error definitely decreases as the k values become larger, but stability's trend is not monotonous. So I think the key is to select the optimal stability while maintaining lower error.
The selection plot above showed that k=3 has highest stability and also lower error than k=2; k=5 has the first increased trend of stability and also lower error; k=7, 10 all have the increased trend of stability compared to their last value. So I believe the key is to select the increased trend (non-monotonous). Then I pick 3, 5, 7, 10 to see the difference of their clustering plots.
k=3 k=5 k=7 k=10 As you can see, k=3 formed the most robust clusters. k=5 also worked fine (after setting the parameter of filtering threshold). k=7 has slightly discrepancies across iterations or spectras (n.iter=100, although here 19% were filtered, I found the default is to filter 30% mentioned in the tutorial) and k=10 has the worst clustering result, which is obviously non-acceptable (I did change the parameter of local-threshold-density, but see no difference of this discrepancy pattern ,though it can occur in different clusters). So here I choose k=5 and k=7, and perform GSEA to annotate these gene expression programs to see if they are proper.
Downstream results were not shown here because the non-ideal GSEA result and also the program usage distribution across cells, which I guess was the problem on the single cell quality. Hope it helps.
Thank you again for your advice.
Yes, I'll just add that choosing K is hard and I recommend to look at the results for a few values of K (as you would do with clustering). Usually only one or two GEPs change at the margin while the majority remain pretty stable. So I recommend exploring what GEPs are changing with the different values of K. I also think GSEA can only help to some extent because often the gene sets available to analyze the programs don't actually tell us what the programs are. So I also recommend looking at the top weighted genes in the gene_spectra_score output.
Hi Dylan!
Yesterday I ran the cNMF in the terminal (Mac M1) and everything went smoothly until the final step, where I encountered an unusual error:
For that reason, I ran it in Python, which worked perfectly with the exception that the k_selection_plot to select the optimal 'k' is completely different from what I obtained in the terminal. I used exactly the same parameters in both methodologies, and the plots look completely different. Here, I will show them to you:
Terminal k_selection_plot: shows the optimal 'k' as 11
Python environment k_selection_plot: shows the optimal 'k' as 7
1) Why is there such a significant difference in the results? 2) Which of them should I trust?
Thank you very much in advance! Looking forward to your reply :)
Lucia