Closed DRL closed 8 years ago
Looks like the CDS phase is off in this particular example, the first CDS should have a phase of 0 - in your example it has a phase of 2. Does the EVM script correct for phase? If you have genome tools installed you can check your input GFF for correct phase and other problems by using gt gff3
which will print warnings and can fix your gff file.
On May 8, 2016, at 6:47 PM, Dominik R Laetsch notifications@github.com wrote:
Hi, when trying to convert a gff to tbl for NCBI submission, I discovered that some proteins were parsed erroneously.
The GFF lines from genome.gff for one of the proteins (GROS_g00189) in question:
GROS_00002 AUGUSTUS gene 1 2817 . + . ID=GROS_g00189 GROS_00002 AUGUSTUS mRNA 1 2817 . + . ID=GROS_g00189.t1;Parent=GROS_g00189 GROS_00002 AUGUSTUS exon 1 197 . + . ID=GROS_g00189.t1.exon1;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 253 378 . + . ID=GROS_g00189.t1.exon2;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 429 586 . + . ID=GROS_g00189.t1.exon3;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 689 827 . + . ID=GROS_g00189.t1.exon4;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 960 1044 . + . ID=GROS_g00189.t1.exon5;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 1104 1226 . + . ID=GROS_g00189.t1.exon6;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 1267 1328 . + . ID=GROS_g00189.t1.exon7;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 1437 1573 . + . ID=GROS_g00189.t1.exon8;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 1636 1807 . + . ID=GROS_g00189.t1.exon9;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 1930 2064 . + . ID=GROS_g00189.t1.exon10;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 2110 2163 . + . ID=GROS_g00189.t1.exon11;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 2423 2491 . + . ID=GROS_g00189.t1.exon12;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 2572 2709 . + . ID=GROS_g00189.t1.exon13;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS exon 2767 2817 . + . ID=GROS_g00189.t1.exon14;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 1 197 . + 2 ID=GROS_g00189.t1.CDS1;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 253 378 . + 0 ID=GROS_g00189.t1.CDS2;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 429 586 . + 0 ID=GROS_g00189.t1.CDS3;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 689 827 . + 1 ID=GROS_g00189.t1.CDS4;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 960 1044 . + 0 ID=GROS_g00189.t1.CDS5;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 1104 1226 . + 2 ID=GROS_g00189.t1.CDS6;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 1267 1328 . + 2 ID=GROS_g00189.t1.CDS7;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 1437 1573 . + 0 ID=GROS_g00189.t1.CDS8;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 1636 1807 . + 1 ID=GROS_g00189.t1.CDS9;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 1930 2064 . + 0 ID=GROS_g00189.t1.CDS10;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 2110 2163 . + 0 ID=GROS_g00189.t1.CDS11;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 2423 2491 . + 0 ID=GROS_g00189.t1.CDS12;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 2572 2709 . + 0 ID=GROS_g00189.t1.CDS13;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS CDS 2767 2817 . + 0 ID=GROS_g00189.t1.CDS14;Parent=GROS_g00189.t1 GROS_00002 AUGUSTUS stop_codon 2815 2817 . + . ID=GROS_g00189.t1:stop;Parent=GROS_g00189.t1 The protein as it is written to genome.proteins.fasta
GROS_g00189.t1 protein GGVVVSV_WERVNCHEKNSAKFSSCAAF_RSPGCTRASSPTWQCPIWSTRTTSSSRSTTNGPYK_RLFGFH_TDLPGSPSTASSWTTSTANALRTSRSQFCAKLSKLKCAANVQLPMLPYHRTMIVQYHPQA_TLSPRLPCRPRRSSSQRRLVPMEDKWKSPRARRCQKFLPIPRHRKRKNRMRQTKSANPTAVSLFTAPTICPQTLPMICSQRKFATHFWSLTASN_PPTGRSK_VGRRRNGWPNSEKMALENGKNCGWN_TVRWRRKSGRSF_RKIAINWSNLWTLFLSDKICRGLCLIAMAIKRAKCARNLQNYWTN_KKKVRPKGRTAANVPAFAWHLGFPCVVPSFPKRTMSKH_IGTRSPYCSTPAWPSPSKRRVC_IRCTRLA_HRKQKRKRRTRH_PLGFPRG_TRRRWCWNCPRSACCRSARRTNKRSQNALDGRRYANFSRRQAAADVLNTMRFVRIGTPGTSG_AMRTPKQSNTRSIGQLPDGWAAC_YARWTTTTRATRAKRAPFRCCTPCTRHAAG_RNPMWSCAI_WACT_AVA How the protein should look like (generated with EVidenceModeler's gff3_file_to_proteins.pl using the genome.gff)
GROS_g00189.t1 GGGGFSLMGKGELSRKEFGQILKLCGILKEPRVHSRFVSDLAVPYLVNKDNEFIAFDNKR SIQIKTVWISLNGFAGIALHGVELDNVDGECPQDESFPILRQIVETQMCSKCSIANVTVP SDHDSTVSSTSVNTKPATSMSSQEEFIAEAISAYGGQMEKPPGAAMPKVSANSEAPQTQK SDAPNEKCQSNGRFSLYCAHNLSTNIANDLLTTEICDSLLVFDRVELTSDGTIKVSGEAE EWMAKFGKNGIGKRQKLWVELNCSMASEEWAKLLKENRHKLVESLDTFFVRQNLSGIVLN CDGHQTGEVREEFTKLLDELKEKSEAKRADSGECAGIRLAFRLPLRRSVLSEAYNVKALN RHSVTVLLDAGMAFSKQKTRLLNPLYAVGIAQKAETQTQNTTLAAWLSEGLNAAQVVLEL PAFGLLQKRETDEQAEPKRIGRAEICKFQQKAGGSGRTQYDAVCSYWDTGDEWVSNENAE TVKYKVHWAIARRLGGVLIRALDDDDPSNACKKGAFPLLHAMHEARCRLKKSHVVMRDLM GMHMSRG* Could you explain the difference to me?
cheers,
dom
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/genomeannotation/GAG/issues/165
Yes, it starts with phase 2 so that the actual position of the first base is 3.
Surprisingly, Genometools does not complain about this. The GFF has been passed through Genometools previously and I just tested it again.
GROS_00002 AUGUSTUS gene 1 2817 0.68 + . ID=GROS_g00189
GROS_00002 AUGUSTUS mRNA 1 2817 0.68 + . ID=GROS_g00189.t1;Parent=GROS_g00189
GROS_00002 AUGUSTUS internal 1 197 0.68 + 2 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 1 197 0.68 + 2 ID=GROS_g00189.t1.CDS1;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 1 197 . + . ID=GROS_g00189.t1.exon1;Parent=GROS_g00189.t1
GROS_00002 . intron 198 252 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 253 378 . + . ID=GROS_g00189.t1.exon2;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 253 378 1 + 0 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 253 378 1 + 0 ID=GROS_g00189.t1.CDS2;Parent=GROS_g00189.t1
GROS_00002 . intron 379 428 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 429 586 1 + 0 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 429 586 . + . ID=GROS_g00189.t1.exon3;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 429 586 1 + 0 ID=GROS_g00189.t1.CDS3;Parent=GROS_g00189.t1
GROS_00002 . intron 587 688 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 689 827 . + . ID=GROS_g00189.t1.exon4;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 689 827 1 + 1 ID=GROS_g00189.t1.CDS4;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 689 827 1 + 1 Parent=GROS_g00189.t1
GROS_00002 . intron 828 959 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 960 1044 1 + 0 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 960 1044 . + . ID=GROS_g00189.t1.exon5;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 960 1044 1 + 0 ID=GROS_g00189.t1.CDS5;Parent=GROS_g00189.t1
GROS_00002 . intron 1045 1103 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 1104 1226 1 + 2 ID=GROS_g00189.t1.CDS6;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 1104 1226 . + . ID=GROS_g00189.t1.exon6;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 1104 1226 1 + 2 Parent=GROS_g00189.t1
GROS_00002 . intron 1227 1266 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 1267 1328 . + . ID=GROS_g00189.t1.exon7;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 1267 1328 1 + 2 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 1267 1328 1 + 2 ID=GROS_g00189.t1.CDS7;Parent=GROS_g00189.t1
GROS_00002 . intron 1329 1436 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 1437 1573 . + . ID=GROS_g00189.t1.exon8;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 1437 1573 1 + 0 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 1437 1573 1 + 0 ID=GROS_g00189.t1.CDS8;Parent=GROS_g00189.t1
GROS_00002 . intron 1574 1635 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 1636 1807 1 + 1 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 1636 1807 . + . ID=GROS_g00189.t1.exon9;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 1636 1807 1 + 1 ID=GROS_g00189.t1.CDS9;Parent=GROS_g00189.t1
GROS_00002 . intron 1808 1929 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 1930 2064 1 + 0 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 1930 2064 . + . ID=GROS_g00189.t1.exon10;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 1930 2064 1 + 0 ID=GROS_g00189.t1.CDS10;Parent=GROS_g00189.t1
GROS_00002 . intron 2065 2109 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 2110 2163 1 + 0 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 2110 2163 1 + 0 ID=GROS_g00189.t1.CDS11;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 2110 2163 . + . ID=GROS_g00189.t1.exon11;Parent=GROS_g00189.t1
GROS_00002 . intron 2164 2422 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 2423 2491 1 + 0 ID=GROS_g00189.t1.CDS12;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 2423 2491 1 + 0 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 2423 2491 . + . ID=GROS_g00189.t1.exon12;Parent=GROS_g00189.t1
GROS_00002 . intron 2492 2571 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS internal 2572 2709 1 + 0 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 2572 2709 1 + 0 ID=GROS_g00189.t1.CDS13;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 2572 2709 . + . ID=GROS_g00189.t1.exon13;Parent=GROS_g00189.t1
GROS_00002 . intron 2710 2766 . + . Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS terminal 2767 2817 1 + 0 Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS CDS 2767 2817 1 + 0 ID=GROS_g00189.t1.CDS14;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS exon 2767 2817 . + . ID=GROS_g00189.t1.exon14;Parent=GROS_g00189.t1
GROS_00002 AUGUSTUS stop_codon 2815 2817 . + 0 Parent=GROS_g00189.t1
EVM corrects for the phase and prints the correct protein.
So, would you suggest that I should modify the GFF file myself? E.g.:
if phase-of-first-CDS != 0:
fix start-position
Or is there a hidden flag somewhere in the code so that it takes into account weird phases ...? Or can you guys easily implement this?
cheers,
dom
I'm surprised genome tools doesn't flag it - I guess maybe since it is 'partial' in the sense that it doesn't start with 'ATG' and then I guess the phase doesn't have to start with 0? I don't know what GAG uses internally to validate (or maybe it doesn't), but eventually in the NCBI .tbl format I think it would need to be classified as partial - what does GAG output for this gene model in the genome.tbl
file? The only thing that gets passed to tbl2asn
is the genome.tbl
file, so perhaps comparing that to your GFF model will at least identify where the error is? I feel like I've run into problems like this before but most of them were derived from my own doing, i.e. I manually edited a gene model and the CDS phase wasn't corrected and thus the gene model in GBK format had internal stops, etc.
Yes, I think that is because it does not start properly.
Entry in genome.tbl is:
locus_tag GROS_g00189
<1 197 mRNA
253 378
429 586
689 827
960 1044
1104 1226
1267 1328
1437 1573
1636 1807
1930 2064
2110 2163
2423 2491
2572 2709
2767 2817
product hypothetical protein
protein_id gnl|ncbi|GROS_g00189.t1
transcript_id gnl|ncbi|GROS_g00189.t1_mrna
<1 197 CDS
253 378
429 586
689 827
960 1044
1104 1226
1267 1328
1437 1573
1636 1807
1930 2064
2110 2163
2423 2491
2572 2709
2767 2817
codon_start 3
product hypothetical protein
protein_id gnl|ncbi|GROS_g00189.t1
transcript_id gnl|ncbi|GROS_g00189.t1_mrna
Does "codon_start 3" refer to this? Could it be that the genome.tbl is actually printed correctly?
The problem in this genome is that we are quite sure about some genes at the beginning and end of scaffolds, since we did manual annotations. So they are partial, but we still want them to show up.
cheers,
dom
Here are some examples of partial codons, however I don't see any examples with multiple exons/CDSs. I wonder if setting codon_start 3
in your example is interpreted as the codon start of every exon in the gene model to 3? When really what you want is the first codon start to be 3 and the rest be 1? I'm really not sure, just a guess and trying to help. I know how frustrating submission is....
What if you try chaining the gene model to start from position 3, i.e. so it's CDS phase in the GFF would be 0 - if you run that through GAG it might make the codon_start
1. I double-checked my tbl
files that have passed NCBI and all of them have codon_start
set to 1. Note I think if you change the gene model, that the rest of the phases will change as they are dependent on the previous phase, so run through genome tools again perhaps before GAG?
Will try to fix the phasing in those genes and see what comes out at the other end when passing it through GAG.
Thanks for the help, will report back if it works.
However, it might be cool if this gets solved at one point at your end. Since there are not many pieces of software that can generate NCBI-submittable files (i.e. I really want your software to work for me) and if I hadn't checked my proteins I would have submitted garbage. At least, maybe list in the stats file how many proteins have insane amounts of stop-codons.
keep on the good work!
cheers,
dom
I'm just a GAG user like you dom! Just trying to help. Hopefully @bruab or one of the other GAG developers will see the thread and respond.
Alright! I thought you were one of the developers :)
Thanks for stepping in then!
I'm one of the original developers, but it was part of a student job that is coming to a close very soon (I'm graduating this week :tada: ).
After going through the code, it does look like GAG just completely ignores the phase when it comes to protein translations. It used to check each possible phase for the most sane looking sequence (i.e. with start and stop codons or without internal stops). Not sure what happened to that code.
I'll try to get this resolved when I go in on Friday.
Congratulations, man! Well done! :)
Thanks for looking into this!
cheers,
dom
Hey @DRL let me know if the patch I just pushed fixes your issue. Thanks!
Hi, when trying to convert a gff to tbl for NCBI submission, I discovered that some proteins were parsed erroneously.
The GFF lines from genome.gff for one of the proteins (GROS_g00189) in question:
The protein as it is written to genome.proteins.fasta
How the protein should look like (generated with EVidenceModeler's gff3_file_to_proteins.pl using the genome.gff)
Could you explain the difference to me?
cheers,
dom