Problems with proteins generated from GFF file

DRL commented 8 years ago

Hi, when trying to convert a gff to tbl for NCBI submission, I discovered that some proteins were parsed erroneously.

The GFF lines from genome.gff for one of the proteins (GROS_g00189) in question:

GROS_00002      AUGUSTUS        gene    1       2817    .       +       .       ID=GROS_g00189
GROS_00002      AUGUSTUS        mRNA    1       2817    .       +       .       ID=GROS_g00189.t1;Parent=GROS_g00189
GROS_00002      AUGUSTUS        exon    1       197     .       +       .       ID=GROS_g00189.t1.exon1;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    253     378     .       +       .       ID=GROS_g00189.t1.exon2;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    429     586     .       +       .       ID=GROS_g00189.t1.exon3;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    689     827     .       +       .       ID=GROS_g00189.t1.exon4;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    960     1044    .       +       .       ID=GROS_g00189.t1.exon5;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    1104    1226    .       +       .       ID=GROS_g00189.t1.exon6;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    1267    1328    .       +       .       ID=GROS_g00189.t1.exon7;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    1437    1573    .       +       .       ID=GROS_g00189.t1.exon8;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    1636    1807    .       +       .       ID=GROS_g00189.t1.exon9;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    1930    2064    .       +       .       ID=GROS_g00189.t1.exon10;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    2110    2163    .       +       .       ID=GROS_g00189.t1.exon11;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    2423    2491    .       +       .       ID=GROS_g00189.t1.exon12;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    2572    2709    .       +       .       ID=GROS_g00189.t1.exon13;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    2767    2817    .       +       .       ID=GROS_g00189.t1.exon14;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1       197     .       +       2       ID=GROS_g00189.t1.CDS1;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     253     378     .       +       0       ID=GROS_g00189.t1.CDS2;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     429     586     .       +       0       ID=GROS_g00189.t1.CDS3;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     689     827     .       +       1       ID=GROS_g00189.t1.CDS4;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     960     1044    .       +       0       ID=GROS_g00189.t1.CDS5;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1104    1226    .       +       2       ID=GROS_g00189.t1.CDS6;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1267    1328    .       +       2       ID=GROS_g00189.t1.CDS7;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1437    1573    .       +       0       ID=GROS_g00189.t1.CDS8;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1636    1807    .       +       1       ID=GROS_g00189.t1.CDS9;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1930    2064    .       +       0       ID=GROS_g00189.t1.CDS10;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     2110    2163    .       +       0       ID=GROS_g00189.t1.CDS11;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     2423    2491    .       +       0       ID=GROS_g00189.t1.CDS12;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     2572    2709    .       +       0       ID=GROS_g00189.t1.CDS13;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     2767    2817    .       +       0       ID=GROS_g00189.t1.CDS14;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        stop_codon      2815    2817    .       +       .       ID=GROS_g00189.t1:stop;Parent=GROS_g00189.t1

The protein as it is written to genome.proteins.fasta

>GROS_g00189.t1 protein
GGVVVSV*WERVNCHEKNSAKFSSCAAF*RSPGCTRASSPTWQCPIWSTRTTSSSRSTTNGPYK*RLFGFH*TDLPGSPSTASSWTTSTANALRTSRSQFCAKLSKLKCAANVQLPMLPYHRTMIVQYHPQA*TLSPRLPCRPRRSSSQRRLVPMEDKWKSPRARRCQKFLPIPRHRKRKNRMRQTKSANPTAVSLFTAPTICPQTLPMICSQRKFATHFWSLTASN*PPTGRSK*VGRRRNGWPNSEKMALENGKNCGWN*TVRWRRKSGRSF*RKIAINWSNLWTLFLSDKICRGLCLIAMAIKRAKCARNLQNYWTN*KKKVRPKGRTAANVPAFAWHLGFPCVVPSFPKRTMSKH*IGTRSPYCSTPAWPSPSKRRVC*IRCTRLA*HRKQKRKRRTRH*PLGFPRG*TRRRWCWNCPRSACCRSARRTNKRSQNALDGRRYANFSRRQAAADVLNTMRFVRIGTPGTSG*AMRTPKQSNTRSIGQLPDGWAAC*YARWTTTTRATRAKRAPFRCCTPCTRHAAG*RNPMWSCAI*WACT*AVA

How the protein should look like (generated with EVidenceModeler's gff3_file_to_proteins.pl using the genome.gff)

>GROS_g00189.t1
GGGGFSLMGKGELSRKEFGQILKLCGILKEPRVHSRFVSDLAVPYLVNKDNEFIAFDNKR
SIQIKTVWISLNGFAGIALHGVELDNVDGECPQDESFPILRQIVETQMCSKCSIANVTVP
SDHDSTVSSTSVNTKPATSMSSQEEFIAEAISAYGGQMEKPPGAAMPKVSANSEAPQTQK
SDAPNEKCQSNGRFSLYCAHNLSTNIANDLLTTEICDSLLVFDRVELTSDGTIKVSGEAE
EWMAKFGKNGIGKRQKLWVELNCSMASEEWAKLLKENRHKLVESLDTFFVRQNLSGIVLN
CDGHQTGEVREEFTKLLDELKEKSEAKRADSGECAGIRLAFRLPLRRSVLSEAYNVKALN
RHSVTVLLDAGMAFSKQKTRLLNPLYAVGIAQKAETQTQNTTLAAWLSEGLNAAQVVLEL
PAFGLLQKRETDEQAEPKRIGRAEICKFQQKAGGSGRTQYDAVCSYWDTGDEWVSNENAE
TVKYKVHWAIARRLGGVLIRALDDDDPSNACKKGAFPLLHAMHEARCRLKKSHVVMRDLM
GMHMSRG*

Could you explain the difference to me?

cheers,

dom

nextgenusfs commented 8 years ago

Looks like the CDS phase is off in this particular example, the first CDS should have a phase of 0 - in your example it has a phase of 2. Does the EVM script correct for phase? If you have genome tools installed you can check your input GFF for correct phase and other problems by using gt gff3 which will print warnings and can fix your gff file.

On May 8, 2016, at 6:47 PM, Dominik R Laetsch notifications@github.com wrote:

Hi, when trying to convert a gff to tbl for NCBI submission, I discovered that some proteins were parsed erroneously.

The GFF lines from genome.gff for one of the proteins (GROS_g00189) in question:

GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS GROS_00002 AUGUSTUS The protein as gene 1 2817 . + . ID=GROS_g00189 mRNA 1 2817 . + . ID=GROS_g00189.t1;Parent=GROS_g00189 exon 1 197 . + . ID=GROS_g00189.t1.exon1;Parent=GROS_g00189.t1 exon 253 378 . + . ID=GROS_g00189.t1.exon2;Parent=GROS_g00189.t1 exon 429 586 . + . ID=GROS_g00189.t1.exon3;Parent=GROS_g00189.t1 exon 689 827 . + . ID=GROS_g00189.t1.exon4;Parent=GROS_g00189.t1 exon 960 1044 . + . ID=GROS_g00189.t1.exon5;Parent=GROS_g00189.t1 exon 1104 1226 . + . ID=GROS_g00189.t1.exon6;Parent=GROS_g00189.t1 exon 1267 1328 . + . ID=GROS_g00189.t1.exon7;Parent=GROS_g00189.t1 exon 1437 1573 . + . ID=GROS_g00189.t1.exon8;Parent=GROS_g00189.t1 exon 1636 1807 . + . ID=GROS_g00189.t1.exon9;Parent=GROS_g00189.t1 exon 1930 2064 . + . ID=GROS_g00189.t1.exon10;Parent=GROS_g00189.t1 exon 2110 2163 . + . ID=GROS_g00189.t1.exon11;Parent=GROS_g00189.t1 exon 2423 2491 . + . ID=GROS_g00189.t1.exon12;Parent=GROS_g00189.t1 exon 2572 2709 . + . ID=GROS_g00189.t1.exon13;Parent=GROS_g00189.t1 exon 2767 2817 . + . ID=GROS_g00189.t1.exon14;Parent=GROS_g00189.t1 CDS 1 197 . + 2 ID=GROS_g00189.t1.CDS1;Parent=GROS_g00189.t1 CDS 253 378 . + 0 ID=GROS_g00189.t1.CDS2;Parent=GROS_g00189.t1 CDS 429 586 . + 0 ID=GROS_g00189.t1.CDS3;Parent=GROS_g00189.t1 CDS 689 827 . + 1 ID=GROS_g00189.t1.CDS4;Parent=GROS_g00189.t1 CDS 960 1044 . + 0 ID=GROS_g00189.t1.CDS5;Parent=GROS_g00189.t1 CDS 1104 1226 . + 2 ID=GROS_g00189.t1.CDS6;Parent=GROS_g00189.t1 CDS 1267 1328 . + 2 ID=GROS_g00189.t1.CDS7;Parent=GROS_g00189.t1 CDS 1437 1573 . + 0 ID=GROS_g00189.t1.CDS8;Parent=GROS_g00189.t1 CDS 1636 1807 . + 1 ID=GROS_g00189.t1.CDS9;Parent=GROS_g00189.t1 CDS 1930 2064 . + 0 ID=GROS_g00189.t1.CDS10;Parent=GROS_g00189.t1 CDS 2110 2163 . + 0 ID=GROS_g00189.t1.CDS11;Parent=GROS_g00189.t1 CDS 2423 2491 . + 0 ID=GROS_g00189.t1.CDS12;Parent=GROS_g00189.t1 CDS 2572 2709 . + 0 ID=GROS_g00189.t1.CDS13;Parent=GROS_g00189.t1 CDS 2767 2817 . + 0 ID=GROS_g00189.t1.CDS14;Parent=GROS_g00189.t1 stop_codon 2815 2817 . + . ID=GROS_g00189.t1:stop;Parent=GROS_g00189.t1 it is written to genome.proteins.fasta

GROS_g00189.t1 protein GGVVVSV_WERVNCHEKNSAKFSSCAAF_RSPGCTRASSPTWQCPIWSTRTTSSSRSTTNGPYK_RLFGFH_TDLPGSPSTASSWTTSTANALRTSRSQFCAKLSKLKCAANVQLPMLPYHRTMIVQYHPQA_TLSPRLPCRPRRSSSQRRLVPMEDKWKSPRARRCQKFLPIPRHRKRKNRMRQTKSANPTAVSLFTAPTICPQTLPMICSQRKFATHFWSLTASN_PPTGRSK_VGRRRNGWPNSEKMALENGKNCGWN_TVRWRRKSGRSF_RKIAINWSNLWTLFLSDKICRGLCLIAMAIKRAKCARNLQNYWTN_KKKVRPKGRTAANVPAFAWHLGFPCVVPSFPKRTMSKH_IGTRSPYCSTPAWPSPSKRRVC_IRCTRLA_HRKQKRKRRTRH_PLGFPRG_TRRRWCWNCPRSACCRSARRTNKRSQNALDGRRYANFSRRQAAADVLNTMRFVRIGTPGTSG_AMRTPKQSNTRSIGQLPDGWAAC_YARWTTTTRATRAKRAPFRCCTPCTRHAAG_RNPMWSCAI_WACT_AVA How the protein should look like (generated with EVidenceModeler's gff3_file_to_proteins.pl using the genome.gff)

GROS_g00189.t1 GGGGFSLMGKGELSRKEFGQILKLCGILKEPRVHSRFVSDLAVPYLVNKDNEFIAFDNKR SIQIKTVWISLNGFAGIALHGVELDNVDGECPQDESFPILRQIVETQMCSKCSIANVTVP SDHDSTVSSTSVNTKPATSMSSQEEFIAEAISAYGGQMEKPPGAAMPKVSANSEAPQTQK SDAPNEKCQSNGRFSLYCAHNLSTNIANDLLTTEICDSLLVFDRVELTSDGTIKVSGEAE EWMAKFGKNGIGKRQKLWVELNCSMASEEWAKLLKENRHKLVESLDTFFVRQNLSGIVLN CDGHQTGEVREEFTKLLDELKEKSEAKRADSGECAGIRLAFRLPLRRSVLSEAYNVKALN RHSVTVLLDAGMAFSKQKTRLLNPLYAVGIAQKAETQTQNTTLAAWLSEGLNAAQVVLEL PAFGLLQKRETDEQAEPKRIGRAEICKFQQKAGGSGRTQYDAVCSYWDTGDEWVSNENAE TVKYKVHWAIARRLGGVLIRALDDDDPSNACKKGAFPLLHAMHEARCRLKKSHVVMRDLM GMHMSRG* Could you explain the difference to me?

cheers,

dom

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/genomeannotation/GAG/issues/165

DRL commented 8 years ago

Yes, it starts with phase 2 so that the actual position of the first base is 3.

Surprisingly, Genometools does not complain about this. The GFF has been passed through Genometools previously and I just tested it again.

GROS_00002      AUGUSTUS        gene    1       2817    0.68    +       .       ID=GROS_g00189
GROS_00002      AUGUSTUS        mRNA    1       2817    0.68    +       .       ID=GROS_g00189.t1;Parent=GROS_g00189
GROS_00002      AUGUSTUS        internal        1       197     0.68    +       2       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1       197     0.68    +       2       ID=GROS_g00189.t1.CDS1;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    1       197     .       +       .       ID=GROS_g00189.t1.exon1;Parent=GROS_g00189.t1
GROS_00002      .       intron  198     252     .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    253     378     .       +       .       ID=GROS_g00189.t1.exon2;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        253     378     1       +       0       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     253     378     1       +       0       ID=GROS_g00189.t1.CDS2;Parent=GROS_g00189.t1
GROS_00002      .       intron  379     428     .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        429     586     1       +       0       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    429     586     .       +       .       ID=GROS_g00189.t1.exon3;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     429     586     1       +       0       ID=GROS_g00189.t1.CDS3;Parent=GROS_g00189.t1
GROS_00002      .       intron  587     688     .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    689     827     .       +       .       ID=GROS_g00189.t1.exon4;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     689     827     1       +       1       ID=GROS_g00189.t1.CDS4;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        689     827     1       +       1       Parent=GROS_g00189.t1
GROS_00002      .       intron  828     959     .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        960     1044    1       +       0       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    960     1044    .       +       .       ID=GROS_g00189.t1.exon5;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     960     1044    1       +       0       ID=GROS_g00189.t1.CDS5;Parent=GROS_g00189.t1
GROS_00002      .       intron  1045    1103    .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1104    1226    1       +       2       ID=GROS_g00189.t1.CDS6;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    1104    1226    .       +       .       ID=GROS_g00189.t1.exon6;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        1104    1226    1       +       2       Parent=GROS_g00189.t1
GROS_00002      .       intron  1227    1266    .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    1267    1328    .       +       .       ID=GROS_g00189.t1.exon7;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        1267    1328    1       +       2       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1267    1328    1       +       2       ID=GROS_g00189.t1.CDS7;Parent=GROS_g00189.t1
GROS_00002      .       intron  1329    1436    .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    1437    1573    .       +       .       ID=GROS_g00189.t1.exon8;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        1437    1573    1       +       0       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1437    1573    1       +       0       ID=GROS_g00189.t1.CDS8;Parent=GROS_g00189.t1
GROS_00002      .       intron  1574    1635    .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        1636    1807    1       +       1       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    1636    1807    .       +       .       ID=GROS_g00189.t1.exon9;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1636    1807    1       +       1       ID=GROS_g00189.t1.CDS9;Parent=GROS_g00189.t1
GROS_00002      .       intron  1808    1929    .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        1930    2064    1       +       0       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    1930    2064    .       +       .       ID=GROS_g00189.t1.exon10;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     1930    2064    1       +       0       ID=GROS_g00189.t1.CDS10;Parent=GROS_g00189.t1
GROS_00002      .       intron  2065    2109    .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        2110    2163    1       +       0       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     2110    2163    1       +       0       ID=GROS_g00189.t1.CDS11;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    2110    2163    .       +       .       ID=GROS_g00189.t1.exon11;Parent=GROS_g00189.t1
GROS_00002      .       intron  2164    2422    .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     2423    2491    1       +       0       ID=GROS_g00189.t1.CDS12;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        2423    2491    1       +       0       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    2423    2491    .       +       .       ID=GROS_g00189.t1.exon12;Parent=GROS_g00189.t1
GROS_00002      .       intron  2492    2571    .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        internal        2572    2709    1       +       0       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     2572    2709    1       +       0       ID=GROS_g00189.t1.CDS13;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    2572    2709    .       +       .       ID=GROS_g00189.t1.exon13;Parent=GROS_g00189.t1
GROS_00002      .       intron  2710    2766    .       +       .       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        terminal        2767    2817    1       +       0       Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        CDS     2767    2817    1       +       0       ID=GROS_g00189.t1.CDS14;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        exon    2767    2817    .       +       .       ID=GROS_g00189.t1.exon14;Parent=GROS_g00189.t1
GROS_00002      AUGUSTUS        stop_codon      2815    2817    .       +       0       Parent=GROS_g00189.t1

EVM corrects for the phase and prints the correct protein.

So, would you suggest that I should modify the GFF file myself? E.g.:

if phase-of-first-CDS != 0:
     fix start-position

Or is there a hidden flag somewhere in the code so that it takes into account weird phases ...? Or can you guys easily implement this?

cheers,

dom

nextgenusfs commented 8 years ago

I'm surprised genome tools doesn't flag it - I guess maybe since it is 'partial' in the sense that it doesn't start with 'ATG' and then I guess the phase doesn't have to start with 0? I don't know what GAG uses internally to validate (or maybe it doesn't), but eventually in the NCBI .tbl format I think it would need to be classified as partial - what does GAG output for this gene model in the genome.tbl file? The only thing that gets passed to tbl2asn is the genome.tbl file, so perhaps comparing that to your GFF model will at least identify where the error is? I feel like I've run into problems like this before but most of them were derived from my own doing, i.e. I manually edited a gene model and the CDS phase wasn't corrected and thus the gene model in GBK format had internal stops, etc.

DRL commented 8 years ago

Yes, I think that is because it does not start properly.

Entry in genome.tbl is:

                        locus_tag       GROS_g00189
<1      197     mRNA
253     378
429     586
689     827
960     1044
1104    1226
1267    1328
1437    1573
1636    1807
1930    2064
2110    2163
2423    2491
2572    2709
2767    2817
                        product hypothetical protein
                        protein_id      gnl|ncbi|GROS_g00189.t1
                        transcript_id   gnl|ncbi|GROS_g00189.t1_mrna
<1      197     CDS
253     378
429     586
689     827
960     1044
1104    1226
1267    1328
1437    1573
1636    1807
1930    2064
2110    2163
2423    2491
2572    2709
2767    2817
                        codon_start     3
                        product hypothetical protein
                        protein_id      gnl|ncbi|GROS_g00189.t1
                        transcript_id   gnl|ncbi|GROS_g00189.t1_mrna

Does "codon_start 3" refer to this? Could it be that the genome.tbl is actually printed correctly?

The problem in this genome is that we are quite sure about some genes at the beginning and end of scaffolds, since we did manual annotations. So they are partial, but we still want them to show up.

cheers,

dom

nextgenusfs commented 8 years ago

Here are some examples of partial codons, however I don't see any examples with multiple exons/CDSs. I wonder if setting codon_start 3 in your example is interpreted as the codon start of every exon in the gene model to 3? When really what you want is the first codon start to be 3 and the rest be 1? I'm really not sure, just a guess and trying to help. I know how frustrating submission is....

What if you try chaining the gene model to start from position 3, i.e. so it's CDS phase in the GFF would be 0 - if you run that through GAG it might make the codon_start 1. I double-checked my tbl files that have passed NCBI and all of them have codon_start set to 1. Note I think if you change the gene model, that the rest of the phases will change as they are dependent on the previous phase, so run through genome tools again perhaps before GAG?

DRL commented 8 years ago

Will try to fix the phasing in those genes and see what comes out at the other end when passing it through GAG.

Thanks for the help, will report back if it works.

However, it might be cool if this gets solved at one point at your end. Since there are not many pieces of software that can generate NCBI-submittable files (i.e. I really want your software to work for me) and if I hadn't checked my proteins I would have submitted garbage. At least, maybe list in the stats file how many proteins have insane amounts of stop-codons.

keep on the good work!

cheers,

dom

nextgenusfs commented 8 years ago

I'm just a GAG user like you dom! Just trying to help. Hopefully @bruab or one of the other GAG developers will see the thread and respond.

DRL commented 8 years ago

Alright! I thought you were one of the developers :)

Thanks for stepping in then!

tedsta commented 8 years ago

I'm one of the original developers, but it was part of a student job that is coming to a close very soon (I'm graduating this week :tada: ).

After going through the code, it does look like GAG just completely ignores the phase when it comes to protein translations. It used to check each possible phase for the most sane looking sequence (i.e. with start and stop codons or without internal stops). Not sure what happened to that code.

I'll try to get this resolved when I go in on Friday.

DRL commented 8 years ago

Congratulations, man! Well done! :)

Thanks for looking into this!

cheers,

dom

tedsta commented 8 years ago

Hey @DRL let me know if the patch I just pushed fixes your issue. Thanks!

genomeannotation / GAG

Problems with proteins generated from GFF file #165