chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
109 stars 37 forks source link

Sliding window analysis for the spliced protein #53

Closed Chongli-90 closed 2 years ago

Chongli-90 commented 3 years ago

Hi Chase,

I am now running the "SNPGenie_sliding_windows.R" for a spliced protein whose CDS is shown below:

M CALLER CDS 1 26 . + 0 gene_id "M2"; M CALLER CDS 715 982 . + 0 gene_id "M2"; M CALLER start_codon 1 3 . + 0 gene_id "M2"; M CALLER stop_codon 980 982 . + 0 gene_id "M2";

The codon results of this protein generated from the "snpgenie.pl" couldn't go through the "SNPGenie_sliding_windows.R" and shows the error message "ERROR: the input seems to contain codons for more than one gene/product. If this is not the case, your gene may contain multiple exons" which make sense since this is a spliced protein.

However, I found I can split the protein codons into two parts and run the sliding window analysis separately for each part (for example, I run the analysis firstly on CDS: 1-26 + 715 and then run the same analysis on CDS: 716-982). I don't know if this is the right way to do the sliding window analysis for this situation. Please advise. Thank you so much!

Regards, Chong

singing-scientist commented 3 years ago

Dear @Chongli-90: very sorry for the delay, and thanks for your patience. Let's see... okay, yes, the code that numbers the codons does so using the sites, which won't work for genes with multiple segments so I coded it to die. If you can provide what is needed to reproduce the error (exact command and input file[s]) then I can update the script tomorrow. Alternatively, if you're in a real pinch, you could comment out lines 118-123 and 125 of the bootstrap script, but know that the codons won't be numbered right. Let me know!

Chongli-90 commented 3 years ago

Dear Chase,

Thank you for your feedback! I have three spliced proteins (PA-X, M2, NS2) whose input documents are attached below. The commands I use to run the script are the same: Rscript SNPGenie_sliding_windows.R input_file N S 2 1 10000 100 NONE 6 > SNPGenie_sliding_windows_oneProduct.out. Please let me know if you need anything else.

Thank you so much for your help! Regards, Chong PA-X.txt M2.txt NS2.txt

singing-scientist commented 3 years ago

Greetings, @Chongli-90, and apologies again for the delay. I have update the script to deal with your case. Note that the numerically ordered site numbers must correspond to the numerically ordered codons for your gene product. This allows exons, trans-splicing, and frameshifted sequences. However, it would NOT be appropriate if (for example) exon1 was present at sites 1-9, exon2 at sites 101-109, and exon3 at sites 51-59 — in that case, the order of the sites (1-9, 51-59, 101-109) would NOT correspond to the order of the codons. Does this make sense?

Please give it a try and let me know if it works as expected. Please also note that 2 is almost never a sufficient window size. I recommend a minimum of 10 codons. A good rule is that the window size should be large enough so that dS is > 0 is most or every window.

Let me know! Chase

Chongli-90 commented 3 years ago

Dear Chase,

Thank you so much for updating the script for my cases! I also really appreciate your comments on the window size and it's very helpful for my research. I run the new R script with the command : SNPGenie_sliding_windows.R Input.txt N S 20 5 10000 100 NONE 6 > SNPGenie_sliding_windows_oneProduct.out. However, It seems the parameter of step size in the command is not working. No matter which number I typed as the step size in the command, the script will run as one codon per step size. I also attached my input and output files below. I don't know if it is the issue during I run the script or in the code. Please let me know how to handle it.

I also have a question regards on how to choose the window size and step size. I found when the window size becomes bigger, it's less likely to find the significant window because the non-mutated sites kind of diluting the N diff/ S diff within a window and there are much more non-mutated sites than the mutated sites. Since my goal is to identify the selective pressure at the codon level, do you think I can achieve it by doing the sliding window analysis? Thank you so much! Input.txt Output.tsv.zip

Regards,

singing-scientist commented 3 years ago

Dear @Chongli-90 :

You are quite right! Apologies, I coded it assuming a step size of 1. I have uploaded a new version that allows the user to specify any step size. A few notes:

  1. As much as possible, all decisions (e.g., step size, window size, selection) should be informed by a biological hypothesis. For example, purifying negative selection is by far the most common type of selection. Thus, there is no reason to expect to find positive selection. Even when a 'signal' of positive selection is 'detected', other factors besides selection (including, simply, random genetic drift or relaxation of purifying selection) could be the real cause. Thus, even when one finds evidence of positive selection for a particular window or codon, one should always be careful to remember there are other possibilities. Such a window or codon can be safely referred to as a "candidate" for positive selection, but there is no certainty.

  2. The power to detect either negative or positive selection depends upon the number of codons with variation observed in the data. It's very common that datasets do not have a sufficient amount of diversity to detect selection using counting methods or π, even at the whole gene level! A good rule is that there should be few or no windows with dS=0. Thus, if there's not enough diversity, it may be necessary to a) use larger windows; b) only analyze whole genes; or c) add additional data or samples or more powerful variant calling methods. Also, if you do analyze smaller window, be sure to use a false-discovery rate correction.

Hope that helps! Please let me know if the new version works.

Yours, Chase