chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
109 stars 37 forks source link

Issue with SNPGenie_sliding_windows.R #72

Closed OmonkeyGOD closed 1 year ago

OmonkeyGOD commented 1 year ago

Hello,

Thank you for creating this program. I am using it to calculate the dN/dS of all genes between populations. I ran snpgenie_between_group.pl for each chromosome and generated the following files between_group_codon_results.txt and between_group_product_results.txt. And I saw in the manual you suggested running SNPGenie_sliding_windows.R after. I have the following questions:

  1. Whether the results from between_group_product_results.txt can be used directly?
  2. From between_group_product_results.txt, I have several products with very big dN/dS values, such as 557.8 and 375.7, for which, the dS values were extremely small, around 8.87E-06 and 3.43E-0.5. Does this normal?
  3. When I run SNPGenie_sliding_windows.R, I received the following error. Could you please let me know if you have any insights into this issue?
Rscript error

Looking forward to your reply. Thank you very much! Yue

singing-scientist commented 1 year ago

Greetings, Yue!

  1. Yes, the same sliding window and bootstrapping procedures can be used for the between-group results.
  2. Very large dN/dS resulting from very small dS is very common, and is more likely for shorter sequences with less overall variation. These values are typically the result of chance fluctuation rather than meaningful biological signal. See for example Austin Hughes and Robert Friedman, “Codon-based tests of positive selection, branch lengths, and the evolution of mammalian immune system genes,” Immunogenetics 60, no. 9 (2008): 495–506.
  3. This looks like an issue with an R update. Could you pleasure provide a minimal example (input files and exact code used) to reproduce this error so I may troubleshoot it?

Thanks so much! Chase

OmonkeyGOD commented 1 year ago

Hi Chase,

Thank you so much for your reply! That helps a lot. I have attached the example files for question 3. Meanwhile, I still have some following concerns regarding questions 1 & 2.

  1. My purpose is to find which genes are under positive selection by comparing two groups with different genotypes, and the between_group_product_results.txt generated by command perl snpgenie_between_group.pl --gtf_file=chrom.gtf --num_bootstraps=100 already contains dN/dS for each gene. Can I use the dN/dS values to infer whether the genes are under positive selection without directly? Since I didn't do sliding window analysis, I am wondering whether sliding window analysis is necessary?
  2. If I understand it correctly, when the overall variation of a gene is small, the gene is unlikely to be under positive selection, regardless of whether dN/dS is larger or not. Do you have any suggested criteria, e.g., dS, N_sites, or dN/dS ranges, to determine whether a gene is likely under positive selection?
  3. The code I use is Rscript SNPGenie_sliding_windows.R between_group_codon_results_onegene.txt N S 100 1 1000 40 NONE 6 > onegene.out. I tried it on the two files below and got errors: between_group_codon_results_onegene.txt between_group_codon_results.txt

Thanks in advance.

singing-scientist commented 1 year ago

Thank you!

  1. Detecting positive selection is a difficult problem with lots of pitfalls and debate. Statistically significant dN/dS > 1 is one piece of evidence suggestive of positive selection, but other issues (alignment error, stochastic error, etc.) can also cause it. Thus, dN/dS alone can be thought of as a mere candidate generator. Moreover, it usually doesn't make sense for a whole GENE to be under positive selection — if it were, then the entire sequence would be rapidly scrambled, maybe not even alignable. Instead, certain sites or small regions within genes may be under positive selection against a backdrop of purifying selection. This is why sliding windows can be so helpful. I recommend reading background and studies that address similar situations to yours to see what others may have done in similar circumstances.

  2. Again, I'm not sure how to understand a whole gene being under positive selection. Certainly a whole gene can be under purifying selection, causing overall low variation, while one codon might be under positive selection. This situation may or may not be possible/impossible to detect (i.e., differentiate from chance), depending on e.g., dS. What is your biological system, such that you expect an entire gene's length to be under positive selection?

  3. Thanks for the example! I have fixed the R version bugs that were leading to this problem; please download and use the new script. Additionally, please note that requiring a minimum codon count of 40 will eliminate ALL the input data from consideration, resulting in NAs, because your number of defined codons in group 2 is only 18. Thus, I suggest taking some time to consider which parameter values might make sense for your data. For example, you may also wish to reconsider your sliding window size of 100 codons.

Let me know if that helps! Chase

OmonkeyGOD commented 1 year ago

Hi Chase,

Thank you very much for your reply and updated script. Your explanation helped a lot.

Yue