chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
102 stars 37 forks source link

gtf error, SNPGenie terminated #10

Closed binshuangli closed 5 years ago

binshuangli commented 5 years ago

Hi, I tried to run SNPGenie on a whole genome to calculate the piN piS. I noticed for some genes SNPGenine will report something like this:

"The CDS coordinates for gene g25564 in the gtf file do not yield a set of complete codons, or are absent from the file. The number of nucleotides must be a multiple of 3."

It seems something wrong with the gtf file. I tried both the gtf file from the official reference of our species and one made by myself using augustus. It always have the same issue. Is there any way to fix this? Thank you!

singing-scientist commented 5 years ago

Hello @binshuangli — thanks very much for using SNPGenie! I am sorry for this trouble.

Have you made sure that the total nucleotide length of g25564 (and the other genes) is indeed evenly divisible by 3 (i.e., a complete codon set)?

Unfortunately, I am not capable of pinpointing the problem without the input files. Please feel free to send me a link to the problematic data (just a subset is fine), and I'll be happy to take a look.

Yours, Chase

binshuangli commented 5 years ago

Hi Chase,

Glad to hear from you! I have subset one scaffold of the files in the following folder. https://drive.google.com/drive/folders/1puaakJB1SZH9lpgYi7CWwJDtFrP6EeZp

I tried myself to correct the potential error in the gtf file as you mentioned, making sure the codon is is divisible by 3. But there're always some issues which I do not quite understand. Please let me know if you need more information. Thank you so much for looking into this!

singing-scientist commented 5 years ago

Many thanks for sharing the example data, @binshuangli !

I had a chance to try this an encountered the same error:

WARNING: The CDS coordinates for gene ACYPI087937 in the gtf file do not yield a set of complete codons, or are absent from the file. The number of nucleotides must be a multiple of 3. SNPGenie terminated.

Thus, I examined the gene ACYPI087937 in the GTF file, and found the following three records:

GL349621 AphidBase_OGS2.1 CDS 1040681 1041076 . + 1 transcript_id "ACYPI064727-RA"; gene_id "ACYPI064727"; gene_name "acyp2eg0000097"; GL349621 AphidBase_OGS2.1 CDS 1042856 1043063 . + 0 transcript_id "ACYPI087937-RA"; gene_id "ACYPI087937"; gene_name "acyp2eg0000098"; GL349621 AphidBase_OGS2.1 CDS 1043161 1043218 58.00 + 2 transcript_id "ACYPI087937-RA"; gene_id "ACYPI087937"; gene_name "acyp2eg0000098";

Summing the lengths of the three gene segments results in:

(1041076-1040681+1) + (1043063-1042856+1) + (1043218-1043161+1) = 662 nucleotides

Unfortunately, 662 nucleotides is not a complete codon set, as 662/3 = 220.6666666667.

Thus, SNPGenie's error is correct, namely, that this (and potentially other) CDS annotations in the GTF file do not form complete codon sets. Unfortunately, dN/dS and πN/πS analyses are impossible without precisely aligned, complete codon sets. Generally, CDS segments should sum to a nucleotide length evenly divisible by 3. If you feel this might represent just a few isolated annotation errors, you could try manually removing the relevant lines of the GTF file each time SNPGenie gives you an error. You could also consider researching the exact genes to see whether they are known to exhibit some sort of translational slippage (but they'll still need to be removed). Whatever you decide, every gene_id annotation will need to fulfill the divisible-by-3 criterion for SNPGenie to work (see the Documentation).

Yours, Chase

binshuangli commented 5 years ago

Hi Chase,

Thank you for your suggestions! I have written a script to modify the gtf file and it seems to be working now. My script is here: https://github.com/binshuangli/miscellaneous/blob/994ee0e52c10ad44da19ea568f0b3bc035a53a87/correct_gtf.py

Basically I adjusted the start/end position (depends on it's + or minus strand) if it has a remainder after dividing by 3. I also used the phase information (0,1,2) to adjust the start position because it seems won't work if I left it as it is. For example supposedly phase = 2 would mean the the codon start at the third position, but the sequence length won't be divided by 3 evenly. So I need to also adjust the start position and change phase to 0.

I'm closing this ticket now. I do have some other questions regarding multifasta files which I will open a new ticket.