Feature positions on the negative strand are not calculated correctly

thobrock commented 8 years ago

This issues seems to be related to this previously closed one: https://github.com/genomeannotation/GAG/issues/165

Several protein sequences reported in the genome.proteins.fasta are not correct and have a frameshift sometimes 2 and sometimes 3). I checked randomly some sequences and it seems the incorrect genes are only located on the negative strand.

I'm wondering, if this problem only occurs during the producing of the protein file or if there are also errors in the final tbl file.

Here are two examples, I hope this helps to investigate the problem:

Example 1

GFF:

chr07   SPP2 gene    6704623 6710341 .       -       .       ID=GENE_g22032;Name=XXXXX
chr07   SPP2 mRNA    6704623 6710341 .       -       .       ID=GENE_g22032.t1;Parent=GENE_g22032;product=XXXXX
770
chr07   SPP2 exon    6704623 6704842 .       -       .       ID=GENE_g22032.t1.exon17;Parent=GENE_g22032.t1
chr07   SPP2 exon    6704933 6705074 .       -       .       ID=GENE_g22032.t1.exon16;Parent=GENE_g22032.t1
chr07   SPP2 exon    6705239 6705343 .       -       .       ID=GENE_g22032.t1.exon15;Parent=GENE_g22032.t1
chr07   SPP2 exon    6705451 6705525 .       -       .       ID=GENE_g22032.t1.exon14;Parent=GENE_g22032.t1
chr07   SPP2 exon    6705607 6705668 .       -       .       ID=GENE_g22032.t1.exon13;Parent=GENE_g22032.t1
chr07   SPP2 exon    6705855 6705939 .       -       .       ID=GENE_g22032.t1.exon12;Parent=GENE_g22032.t1
chr07   SPP2 exon    6706710 6706793 .       -       .       ID=GENE_g22032.t1.exon11;Parent=GENE_g22032.t1
chr07   SPP2 exon    6706921 6706977 .       -       .       ID=GENE_g22032.t1.exon10;Parent=GENE_g22032.t1
chr07   SPP2 exon    6707303 6707473 .       -       .       ID=GENE_g22032.t1.exon9;Parent=GENE_g22032.t1
chr07   SPP2 exon    6707588 6707679 .       -       .       ID=GENE_g22032.t1.exon8;Parent=GENE_g22032.t1
chr07   SPP2 exon    6707772 6707826 .       -       .       ID=GENE_g22032.t1.exon7;Parent=GENE_g22032.t1
chr07   SPP2 exon    6708088 6708181 .       -       .       ID=GENE_g22032.t1.exon6;Parent=GENE_g22032.t1
chr07   SPP2 exon    6708746 6708839 .       -       .       ID=GENE_g22032.t1.exon5;Parent=GENE_g22032.t1
chr07   SPP2 exon    6708917 6709004 .       -       .       ID=GENE_g22032.t1.exon4;Parent=GENE_g22032.t1
chr07   SPP2 exon    6709195 6709320 .       -       .       ID=GENE_g22032.t1.exon3;Parent=GENE_g22032.t1
chr07   SPP2 exon    6709457 6709549 .       -       .       ID=GENE_g22032.t1.exon2;Parent=GENE_g22032.t1
chr07   SPP2 exon    6709767 6710341 .       -       .       ID=GENE_g22032.t1.exon1;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6704769 6704842 .       -       2       ID=GENE_g22032.t1.cds17;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6704933 6705074 .       -       0       ID=GENE_g22032.t1.cds16;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6705239 6705343 .       -       0       ID=GENE_g22032.t1.cds15;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6705451 6705525 .       -       0       ID=GENE_g22032.t1.cds14;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6705607 6705668 .       -       2       ID=GENE_g22032.t1.cds13;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6705855 6705939 .       -       0       ID=GENE_g22032.t1.cds12;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6706710 6706793 .       -       0       ID=GENE_g22032.t1.cds11;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6706921 6706977 .       -       0       ID=GENE_g22032.t1.cds10;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6707303 6707473 .       -       0       ID=GENE_g22032.t1.cds9;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6707588 6707679 .       -       2       ID=GENE_g22032.t1.cds8;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6707772 6707826 .       -       0       ID=GENE_g22032.t1.cds7;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6708088 6708181 .       -       1       ID=GENE_g22032.t1.cds6;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6708746 6708839 .       -       2       ID=GENE_g22032.t1.cds5;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6708917 6709004 .       -       0       ID=GENE_g22032.t1.cds4;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6709195 6709320 .       -       0       ID=GENE_g22032.t1.cds3;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6709457 6709549 .       -       0       ID=GENE_g22032.t1.cds2;Parent=GENE_g22032.t1
chr07   SPP2 CDS     6709767 6710327 .       -       0       ID=GENE_g22032.t1.cds1;Parent=GENE_g22032.t1

Incorrect protein sequence:

>GENE_g22032.t1 protein
GGNVSFSKLLLPQRRTDKSWKLCKQPQFFRLNLYPILVEI**TIW*TT*KDSPA*SANATD*VP*KSRYKWSPS*NGPYE*SHKEKSPVCKWC*DSQWFHTKN*WRKYS*ASCCSGPCEETKI*RTSAR*RSKGFTFR*RFQLG**ELQLSPKKYRCVVVCPFITGSCILGQCKMDICGRLH*R*TAWSYFYQTWAVILY*IRSISSRIC*RAC*VAGQSTCLLPEESKRVY*KGAGSSS*YFV*RV*RSTHCSS*SWSGASGYLT*W*ESCCKSSETRPKETI*H*SEKFEVDCGVLSEK*NSRRSNSGLDRNI*RMC*DTV*GNRLYK*REEC**ISSRLSQYKVGPSASGILGLYSNEGFDSGICPRDKDQSIRSNRYTGL*PIPNCFTCH*SLLNSDTEDWFFPC*STSGKSCN*H*RSFNLL*LWYDGRHQEFHKREIIGSFLCCL*KGREKGYARAY*SGSTSAYRRHVSGEEIYSVFLG*LVESKTRSTADFVCNW*GFVCYSN*STIPFSVNFYICDQGIFHT*RNRLYIGS*FLLP*DCSTICTGALGLETETAQWSSACSGNKETSR*CSDIHYFHALQSSADRRDR*TA*IWRLETQS*GP*V*ESS*KSYNSSNGNHVHSYRGHPPKSWCHI*QSRKSTICKWIFYWCRCFSYSLY*VHATSEQT**I*ENDL

Correct CDS (produced with a different tool):

ATGGCGGCAATGTTAGCTTCTCAAAGCTGTTATTGCCGCAGAGGAGAACTGATAAATCATGGAAGCTATGTAAACAACCTCAGTTCTTCAGGCTCAATCTCTATCCTATTCTTGTCGAAATTTGATAGACAATATGGTAAACCACCTAGAAAGATTCACCGGCTTAAAGTGCAAATGCAACAGACTGAGTCCCCTGAAAAAGTAGGTATAAATGGTCGCCCAGTTAAAATGGTCCCTACGAGTGAAGTCACAAAGAGAAAAGTCCAGTCTGTAAATGGTGCTGAGATAGTCAATGGTTCCACACAAAGAATTAATGGCGCAAGTATAGTTAAGCGAGTTGCTGCTCCGGCCCCTGTGAAGAAACAAAAATCTGAAGAACTTCCGCCCGTTGAAGGTCTAAAGGTTTTACCTTCAGATGAAGGTTTCAGCTGGGCTAATGAGAACTACAACTCAGTCCAAAGAAGTATAGATGTGTGGTCGTTTGTCCTTTCATTACGGGTTCGTGTATTCTTGGACAATGCAAAATGGACATATGTGGGAGGCTTCACTGAAGATAAACAGCTTGGTCCTACTTTTATCAAACTTGGGCAGTTATCCTCTACTAGATCAGATCTATTTCCTCGAGAATTTGTTGACGAGCTTGCTAAGTTGCAGGACAGAGTACCTGCCTTCTCCCCGAAGAAAGCAAAAGAGTTTATTAAAAAGGAGCTGGGAGTTCCAGTTGATATTTTGTATAAAGAGTTTGAAGATCAACCCATTGCAGCAGCTAGTCTTGGTCAGGTGCATCGGGCTATCTTACATAATGGTGAGAAAGTTGTTGTAAAAGTTCAGAGACCCGGCCTAAAGAAACTATTTGACATTGATCTGAAAAATTTGAAGTTGATTGCGGAGTACTTTCAGAAAAGTGAAACTCTCGGCGGTCCAACTCGGGACTGGATAGGAATATATGAAGAATGTGCTAAGATACTGTATGAGGAAATCGATTATATAAATGAAGGGAAGAATGCTGATAGATTTCGTCGAGACTTTCGCAATATAAAGTGGGTCCGAGTGCCTCTGGTATACTGGGACTATACAGCAACGAAGGTTTTGACTCTGGAATATGTCCCAGGGATAAAGATCAATCAATTAGATCTAATAGGTACACGGGGTTATAGCCGATCCCGAATTGCTTCACGTGCCATTGAAGCCTACTTAATTCAGATACTGAGGACTGGTTTTTTCCATGCTGATCCACATCCGGGAAATCTTGCAATTGACACTGACGAAGCTTTAATCTATTATGACTTTGGTATGATGGGAGACATCAAGAGTTTCACAAGAGAGAGATTATTGGGTCTTTTTTATGCTGTTTATGAAAAGGACGCGAAAAAGGTTATGCAAGGGCTTATTGATCTGGGAGCACTTCAGCCTACAGGAGACATGTCAGCGGTGAGGAGATCTATTCAGTTTTTCTTGGATAACTTGTTGAATCAAAGACCAGATCAACAGCAGACTTTGTCTGCAATTGGTGAGGATTTGTTTGCTATAGCAACTGATCAACCATTCCGTTTTCCGTCAACTTTTACATTTGTGATCAGGGCATTTTCCACACTTGAAGGAATCGGCTATATATTGGATCCTGATTTCTCCTTCCCTAAGATTGCAGCACCATATGCACAGGAGCTCTTGGACTTGAGACAGAAACAGCGCAGTGGTCCTCAGCTTGTTCAGGAAATAAGGAAACAAGCCGATGATGCTCGGACATACACTATTTCCATGCCTTACAGAGTTCAGCGGATAGAAGAGATCGTTAAACAGCTTGAATCTGGAGACTTGAAACTCAGAGTTAGGGTCCTTGAGTCTGAGAGAGCAGCTAGAAAAGCTACAATTCTTCAAATGGCAACCATGTACACAGTTATAGGGGGCACCCTCCTAAATCTTGGTGTCACATTTAGCAATCAAGGAAGTCAACTATTTGCAAATGGATCTTTTATTGGTGCAGGTGTTTTTCTTACTCTCTTTATTAGGTCCATGCAACGAGTGAACAAACTTGATAAATTTGAGAAAATGATTTAA

Example 2

GFF:

chr07   SPP2 gene    45754474        45761252        .       -       .       ID=GENE_g18130;Name=XXXXX
chr07   SPP2 mRNA    45754474        45761252        .       -       .       ID=GENE_g18130.t1;Parent=GENE_g18130;product=XXXXX
chr07   SPP2 exon    45754474        45754640        .       -       .       ID=GENE_g18130.t1.exon1;Parent=GENE_g18130.t1
chr07   SPP2 exon    45754728        45754786        .       -       .       ID=GENE_g18130.t1.exon2;Parent=GENE_g18130.t1
chr07   SPP2 exon    45755000        45755101        .       -       .       ID=GENE_g18130.t1.exon3;Parent=GENE_g18130.t1
chr07   SPP2 exon    45755222        45755301        .       -       .       ID=GENE_g18130.t1.exon4;Parent=GENE_g18130.t1
chr07   SPP2 exon    45756691        45756775        .       -       .       ID=GENE_g18130.t1.exon5;Parent=GENE_g18130.t1
chr07   SPP2 exon    45756977        45757060        .       -       .       ID=GENE_g18130.t1.exon6;Parent=GENE_g18130.t1
chr07   SPP2 exon    45757399        45757484        .       -       .       ID=GENE_g18130.t1.exon7;Parent=GENE_g18130.t1
chr07   SPP2 exon    45757804        45757877        .       -       .       ID=GENE_g18130.t1.exon8;Parent=GENE_g18130.t1
chr07   SPP2 exon    45758590        45758674        .       -       .       ID=GENE_g18130.t1.exon9;Parent=GENE_g18130.t1
chr07   SPP2 exon    45761099        45761252        .       -       .       ID=GENE_g18130.t1.exon10;Parent=GENE_g18130.t1
chr07   SPP2 CDS     45754535        45754640        .       -       1       ID=GENE_g18130.t1.cds1;Parent=GENE_g18130.t1
chr07   SPP2 CDS     45754728        45754786        .       -       0       ID=GENE_g18130.t1.cds2;Parent=GENE_g18130.t1
chr07   SPP2 CDS     45755000        45755101        .       -       0       ID=GENE_g18130.t1.cds3;Parent=GENE_g18130.t1
chr07   SPP2 CDS     45755222        45755301        .       -       2       ID=GENE_g18130.t1.cds4;Parent=GENE_g18130.t1
chr07   SPP2 CDS     45756691        45756775        .       -       0       ID=GENE_g18130.t1.cds5;Parent=GENE_g18130.t1
chr07   SPP2 CDS     45756977        45757060        .       -       0       ID=GENE_g18130.t1.cds6;Parent=GENE_g18130.t1
chr07   SPP2 CDS     45757399        45757484        .       -       2       ID=GENE_g18130.t1.cds7;Parent=GENE_g18130.t1
chr07   SPP2 CDS     45757804        45757877        .       -       1       ID=GENE_g18130.t1.cds8;Parent=GENE_g18130.t1
chr07   SPP2 CDS     45758590        45758672        .       -       0       ID=GENE_g18130.t1.cds9;Parent=GENE_g18130.t1

Incorrect protein sequence:

>GENE_g18130.t1 protein
*LILLYHGISCL*I**NKRRGNQVLKGRKGHM*FLYPQEVSILLLTCTCAVLSWQEMH*LQKGFA*LEVICHQ*MMHIRRRVLYLLSIVLQCAN*LVKAQNSL*QIPGRQPRIAINEH*QFSPE*SPLSVVEVWHPVKPSWSCLCVVLIC*NLSAHLEFGYLSRSGPYVETLAWFVSGEVVRMLKRSLRVMIF*TNIRKISKLWMK*CLMESVQRD*GTASRKGYQ*ST*QPMK*LII*NNITFIEGNVLTT

Correct CDS (produced with a different tool):

ATGACTGATATTGCTTTACCATGGGATAAGTTGTCTTTAGATTTAATAAAACAAGAGGAGGGGCAATCAAGTGCTGAAAGGAAGAAAAGGACATATGTAGTTCTTGTATCCACAGGAAGTTTCAATCCTCCTACTTACATGCACTTGCGCTGTTTTGAGTTGGCAAGAGATGCATTGACTTCAGAAGGGCTTTGCGTAATTGGAGGTTATATGTCACCAGTAAATGATGCATATAAGAAGAAGGGTCTTATATCTGCTGAGCATCGTGTTGCAATGTGCCAATTAGCTTGTAAAAGCTCAGAATTCGTTATGACAGATCCCTGGGAGGCAACCCAGGATAGCTATCAACGAACATTGACAGTTCTCTCCAGAATAAAGTCCGCTCTCTGTGGTGGAAGTTTGGCATCCAGTGAAGCCCTCATGGTCATGCTTGTGTGTGGTTCTGATCTGCTAGAATCTTTCAGCACACCTGGAGTTTGGATACCTGAGCAGGTCAGGACCATATGTAGAGACTTTGGCTTGGTTTGTGTCCGGCGAGGTGGTCAGGATGTTGAAAAGATCATTGCGGGTGATGATATTTTGAACGAATATAAGAAAAATATCAAAGTTGTGGATGAAGTAGTGCCTAATGGAATCAGTTCAACGGGATTAAGGGACTGCATCTCGAAAGGGTTATCAGTGAAGTACTTGACAGCCGATGAAGTAATTGATTATATGAAACAACATAACCTTTATAGAGGGCAATGTTCTAACAACTGA

nextgenusfs commented 8 years ago

I'm just another user and not a developer, but if there are errors in the .tbl file they should be found by the discrepancy report produced by tbl2asn when converting to sqn/gbk flatfile.

smg283 commented 8 years ago

We will take a look on Monday. I think this only impacts the printing of the .pep file and not the .tbl but will track it down. scott

On Wed, Aug 3, 2016 at 2:22 AM, Jon Palmer notifications@github.com wrote:

I'm just another user and not a developer, but if there are errors in the .tbl file they should be found by the discrepancy report produced by tbl2asn when converting to sqn/gbk flatfile.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/genomeannotation/GAG/issues/166#issuecomment-237221944, or mute the thread https://github.com/notifications/unsubscribe-auth/AFgKekzHuOMat1t1p6Udj0WvpMMAbplMks5qcIfqgaJpZM4JbZKx .

thobrock commented 8 years ago

Thanks for investigating this problem.

tedsta commented 8 years ago

It looks like GAG was pulling the sequences in increasing order of coordinates regardless of the strand. We made it so it pulls sequences in decreasing order of coordinates for the '-' strand. Could you check to see if that fixed your problem? If not, could you provide a small subset of your data?

thobrock commented 8 years ago

Thanks for the update. But it seems it introduced an additional error (without using your last commit, this error doesn't appear).

Command: python gag.py --fasta Athaliana_167_TAIR9.fa --gff Athaliana_167_TAIR10.gene_exons.gff3 --fix_terminal_ns --fix_start_stop --remove_introns_shorter_than 10 --out outputFolder.GAG

Error message:

Traceback (most recent call last):
  File "/data4/Programs/GAG/gag.py", line 45, in <module>
    main()
  File "/data4/Programs/GAG/gag.py", line 42, in main
    controller.execute(args)
  File "/data4/Programs/GAG/src/controller.py", line 50, in execute
    self.stats_mgr.update_ref(seq.stats())
  File "/data4/Programs/GAG/src/sequence.py", line 486, in stats
    stats["Shortest intron"] = int(self.get_shortest_intron())
  File "/data4/Programs/GAG/src/sequence.py", line 383, in get_shortest_intron
    length = gene.get_shortest_intron()
  File "/data4/Programs/GAG/src/gene.py", line 192, in get_shortest_intron
    length = mrna.get_shortest_intron()
  File "/data4/Programs/GAG/src/xrna.py", line 315, in get_shortest_intron
    raise Exception("Intron with negative length on "+self.name)
AttributeError: XRNA instance has no attribute 'name'

(I used the files 'Athaliana_167_TAIR9.fa' and 'Athaliana_167_TAIR10.gene_exons.gff3' downloaded from Phytozome11 for testing.)

thobrock commented 8 years ago

The latest commit doesn't work - the script crashed before producing the files. So I still used the version without the latest commit, because I also can produce the protein sequence with a different script.

But I also realized that the produced GFF file contains wrong positions of the detected Start- and Stop-codons (using -fix_start_stop). Maybe this error is also related to this issue, because it seems it only happens on the negative strand. Does this also influences the final tba file?

PS1: Thanks for creating this script. And I hope this problem can be fixed that I can use it for my files.

PS2: @tedsta Do you still need a subset or example files to investigate this problem?

fbremer commented 8 years ago

I'll take a look at this today. Example files would certainly help. Thank you for the report

fbremer commented 8 years ago

Alright, I fixed the error in get_shortest_intron which wasn't handling indices ordered for the '-' strand. @bioinf-ice, please run with your data and let me know if you are getting expected output.

Thank you for bringing this to our attention

thobrock commented 8 years ago

@Woods26 Thanks for your response. But I can't see a new commit. Did you forget to push your edits to github?

fbremer commented 8 years ago

Sorry, I forgot to push. Should be visible now: 0baef1e

P.S. looks like some of the internals are assuming order, so tomorrow I will explore an alternate method of solving #165

let me know what your testing reveals if you get a chance.

thobrock commented 8 years ago

Sorry to say this, but it still doesn't solve the problem. Several protein sequences seem not correct and also the statistics for the CDS changed drastically (in genome.stats). ~~I will try to provide a small test file soon.~~

Edit: Here is an full example using two transcipts from the Athaliana_167_TAIR10 from Phytozome11. This example produces incorrect protein sequences and also the GAG script does not detect the stop codon at the end and therefor it reports them incorrect in the gff file.

Input files: example.gff3.zip example_genome.fa.zip

Output files produced by GAG: GAG_output.zip

Corrrect protein sequences: genome.proteins.correct.fasta.zip

I used this command to run the GAG script: python gag.py --fasta Athaliana_167_TAIR9.fa --gff Athaliana_167_TAIR10.gene_exons.gff3 --fix_terminal_ns --fix_start_stop --remove_introns_shorter_than 10 --out GAG

I hope this helps to investigate the problem further.

And thanks for your time for improving and correcting this tool. I really appreciate it.

PS: I hope zip files are OK. But otherwise github doesn't accept the original file format.

fbremer commented 8 years ago

zips should be fine. Thank you for these resources

thobrock commented 8 years ago

Does the most recent commit https://github.com/genomeannotation/GAG/commit/98da78e2a531ab4711e335e3e313a701c134370d works as expected? Sorry, but I hadn't time to test it by myself so far.

The previous version also had some errors regarding detecting the correct position of the start/stop codons and the correct length of introns (to remove the short introns). This errors were introduced in the processed GFF file and therefor still present in the final tbl file.

In the end, I used my own scripts to perform this steps to get a working tbl file for submission.

PS: I edited the title from 'Protein sequences are not correct' to 'Feature positions on the negative strand are not calculated correctly', because I think all this errors are related and the title is more suitable.

fbremer commented 8 years ago

No, not working yet. Thank you, that seems like a good name change.

nextgenusfs commented 7 years ago

Hi all, The protein translation is still not right on the reverse strand, here is an example I just ran with the development branch. The positions in the tbl file are all correct, but the translation is wrong.

GAG genome.gff file

##gff-version 3
chr1    EVM gene    3964916 3966867 .   -   .   ID=AFUI_01436
chr1    EVM mRNA    3964916 3966867 .   -   .   ID=AFUI_01436-T1;Parent=AFUI_01436
chr1    EVM exon    3964916 3965828 .   -   .   ID=evm.model.unplaced_39.51.exon4;Parent=AFUI_01436-T1
chr1    EVM exon    3965895 3966091 .   -   .   ID=evm.model.unplaced_39.51.exon3;Parent=AFUI_01436-T1
chr1    EVM exon    3966145 3966588 .   -   .   ID=evm.model.unplaced_39.51.exon2;Parent=AFUI_01436-T1
chr1    EVM exon    3966649 3966867 .   -   .   ID=evm.model.unplaced_39.51.exon1;Parent=AFUI_01436-T1
chr1    EVM CDS 3964916 3965828 .   -   1   ID=cds.evm.model.unplaced_39.51;Parent=AFUI_01436-T1
chr1    EVM CDS 3965895 3966091 .   -   0   ID=cds.evm.model.unplaced_39.51;Parent=AFUI_01436-T1
chr1    EVM CDS 3966145 3966588 .   -   0   ID=cds.evm.model.unplaced_39.51;Parent=AFUI_01436-T1
chr1    EVM CDS 3966649 3966867 .   -   0   ID=cds.evm.model.unplaced_39.51;Parent=AFUI_01436-T1
chr1    EVM start_codon 3966865 3966867 .   -   .   ID=AFUI_01436-T1:start;Parent=AFUI_01436-T1
chr1    EVM stop_codon  3964916 3964918 .   -   .   ID=AFUI_01436-T1:stop;Parent=AFUI_01436-T1
chr1    EVM gene    3967096 3969745 .   +   .   ID=AFUI_01437
chr1    EVM mRNA    3967096 3969745 .   +   .   ID=AFUI_01437-T1;Parent=AFUI_01437
chr1    EVM exon    3967096 3968205 .   +   .   ID=evm.model.unplaced_39.52.exon1;Parent=AFUI_01437-T1
chr1    EVM exon    3968258 3969745 .   +   .   ID=evm.model.unplaced_39.52.exon2;Parent=AFUI_01437-T1
chr1    EVM CDS 3967096 3968205 .   +   0   ID=cds.evm.model.unplaced_39.52;Parent=AFUI_01437-T1
chr1    EVM CDS 3968258 3969745 .   +   0   ID=cds.evm.model.unplaced_39.52;Parent=AFUI_01437-T1
chr1    EVM start_codon 3967096 3967098 .   +   .   ID=AFUI_01437-T1:start;Parent=AFUI_01437-T1
chr1    EVM stop_codon  3969743 3969745 .   +   .   ID=AFUI_01437-T1:stop;Parent=AFUI_01437-T1

Proteins output by GAG

>AFUI_01436-T1 protein
**IP*KDAPREASCAAVSEGVARCTDQRGRREQQATPASMRQISQTTPRRSSPSFYIATIGF*H*YPTIGQPWLFSQHQYALGGAKQCSHVNSGRSPAPGLKPCSRPFQSSQILEPSIQPLNRTMFALSTNLLTSCTSSSSRTANPTSFKTLIAFIFSPK*PRASARAWTSGRLFAMPSNCLARSTKSSRWDTERICRFLRSRHSWKWKVTKREYRKLSKETKSLKPARSAKERRSSWRCSERRQLAVAGLWHPGPHPILYTLLHHVLLLPKHTILMRRRRRSHLRNRCPHAEKACNLAKNRRPQISTRKFVVIWAPKLMNRVHW*LPKFRRRQPRGSPPLGPHYPPTESLSISLLLKQFLPNSPVKVP*SRSR*RATSSSVSPTHLSRKSSWTCWPILHMAHNSAPIPMLTRPFSPAPQPFS*KT*PSVSQPTTRLASFVGALLVRGQKMLISSPSRSQCG*IRAPTPPL*QSSMSSRARIR*ETLLFPYLLAQQSLQSLASTLSMRSLGIVWTGTSVPWTRPTLQEALNLNLQAMETKTSSSR*MCASPKSVHSLR*MSPMSLYSRWKGRAQDSQKM*EALRRVTLSN
>AFUI_01437-T1 protein
MATASSLAFTNMEVPGLDDTMEMASPYQGHVDDFDIDLDVMEDQASNADKDMTAADDYPDTLHDEDINQDGPRDADMIDDVAEPTMVDADDQYQEASHNVEMQYGIEESHEEEMLEDEYVDDADTAAPEYQDVQVCDNTDKPGTEEAGSAEYPVAGDHAQEPAVGAHLEAKPKYHDGNGHELQGALDQRNDDADQPPDTHQPEIAVDTINSTNETDQVDSALKPLDAAKTRPDPGHPEGSDDYNEGSDEHASQISQPVEEHGAAGDEGTDEQAAQDLGLQGPEQLSEHEPDTTGQVALHPVKVYYQENEISLFPPHEGDSSETFFLEDESLAYEPLGKLFESCREVLYEHVGENETLVLDIESLNLQLMEDSTHISKVTLHQIIDVYLRLCHNDGIESPEPLYLTLSTKSTISAEVSNLLLAASEGKGLSEINLWDDYQEVDEQLGEVSEAHDEGAHDEEEEEEEEEEEEEEEEEEEEEEEEEEKEEAQGQEEEPISPNRPMTNQNVDARELQAASDEDESHHDIQDNGEAFKAQELVGNDTASLAKRSSSVAPATDVQEDELNEEGEVEDANDPEFDKGTASNEEAAHEESNDSEEQQTDTTVTMTHLPGADMNDELENKDGSASPVHNDQTDRDLGNEHSEIDKSGEENLERTGPVELEVPEGPGVSDDERQDSGSVKPVVGDSRTEQSLAGHGGPSIDDAVDNVEEHSQDLEESSTAGDDDYLHKDSESVPTLQVSERQERDTPDPANDTLGFVEDLFESPSKGSKNKDGNLDDTSEFGKNHDEFEDDTPAEILRQHDDELPLDDEEYIDLGFADGPDTVDESPVSGLRSLDNRASKRIRDPEDEFDIAENASPDLKRSRSS*

Genome tbl file

3966867 3964916 gene
            locus_tag   AFUI_01436
3966867 3966649 mRNA
3966588 3966145
3966091 3965895
3965828 3964916
            product hypothetical protein
            protein_id  gnl|ncbi|AFUI_01436-T1
            transcript_id   gnl|ncbi|AFUI_01436-T1_mrna
3966867 3966649 CDS
3966588 3966145
3966091 3965895
3965828 3964916
            codon_start 1
            product hypothetical protein
            protein_id  gnl|ncbi|AFUI_01436-T1
            transcript_id   gnl|ncbi|AFUI_01436-T1_mrna
3967096 3969745 gene
            locus_tag   AFUI_01437
3967096 3968205 mRNA
3968258 3969745
            product hypothetical protein
            protein_id  gnl|ncbi|AFUI_01437-T1
            transcript_id   gnl|ncbi|AFUI_01437-T1_mrna
3967096 3968205 CDS
3968258 3969745
            codon_start 1
            product hypothetical protein
            protein_id  gnl|ncbi|AFUI_01437-T1
            transcript_id   gnl|ncbi|AFUI_01437-T1_mrna

Converted to GBK via tbl2asn

     gene            complement(3964916..3966867)
                     /locus_tag="AFUI_01436"
     mRNA            complement(join(3964916..3965828,3965895..3966091,
                     3966145..3966588,3966649..3966867))
                     /locus_tag="AFUI_01436"
                     /product="hypothetical protein"
     CDS             complement(join(3964916..3965828,3965895..3966091,
                     3966145..3966588,3966649..3966867))
                     /locus_tag="AFUI_01436"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="ncbi:AFUI_01436-T1"
                     /translation="MIDSVEGRTQRSFLCCGKRGGSAVHGPTGTKRTASDSGFNEADI
                     PNNPQKKQPFLLHSNHRFLTLIPNNRATMVVLAASICTRGGKAVLSRQFREIARSRIE
                     ALLASFPKLADSGTQHTTVEQDNVRFVYQPLDELYIVLITNRQSNILQDIDSLHLFAQ
                     VTTSICKSLDEREIVRNAFELLSAFDEIVTLGYRENLSLSQIKTFLEMESHEERIQEI
                     IERNKELEASEERKRKAKQLEMQRKEAARSGRSMAPRAPSYPVYTPPSRPAAPETYDT
                     YEAEKKKSFAKPLPTRGKGMQLGKKSKTTDIYEKVRGDLGPEVDESSPLVTPQVPTPA
                     AERVPSARASLSADREPVHITIAETISAKLTREGALKSFEVKGDLQLRITDPSFTKIK
                     LDLLANPTHGAQFRTHPNVDKAVFTSSSAIQLKDLTKRFPANNSIGVLRWRVASSGSE
                     NADILPITFTVWVNKGSDSTTVTIEYELTGSDTLRDVVVSIPFGATEPTVSSFDAVYE
                     VSGDSLDWNIGTVDETNASGSFEFESAGDGDENEFFPMNVRFSKVSPFVEVDVTNVSL
                     LEMEGESTGFSKDVRSIAEGYVIE"
     gene            3967096..3969745
                     /locus_tag="AFUI_01437"
     mRNA            join(3967096..3968205,3968258..3969745)
                     /locus_tag="AFUI_01437"
                     /product="hypothetical protein"
     CDS             join(3967096..3968205,3968258..3969745)
                     /locus_tag="AFUI_01437"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="ncbi:AFUI_01437-T1"
                     /translation="MATASSLAFTNMEVPGLDDTMEMASPYQGHVDDFDIDLDVMEDQ
                     ASNADKDMTAADDYPDTLHDEDINQDGPRDADMIDDVAEPTMVDADDQYQEASHNVEM
                     QYGIEESHEEEMLEDEYVDDADTAAPEYQDVQVCDNTDKPGTEEAGSAEYPVAGDHAQ
                     EPAVGAHLEAKPKYHDGNGHELQGALDQRNDDADQPPDTHQPEIAVDTINSTNETDQV
                     DSALKPLDAAKTRPDPGHPEGSDDYNEGSDEHASQISQPVEEHGAAGDEGTDEQAAQD
                     LGLQGPEQLSEHEPDTTGQVALHPVKVYYQENEISLFPPHEGDSSETFFLEDESLAYE
                     PLGKLFESCREVLYEHVGENETLVLDIESLNLQLMEDSTHISKVTLHQIIDVYLRLCH
                     NDGIESPEPLYLTLSTKSTISAEVSNLLLAASEGKGLSEINLWDDYQEVDEQLGEVSE
                     AHDEGAHDEEEEEEEEEEEEEEEEEEEEEEEEEEKEEAQGQEEEPISPNRPMTNQNVD
                     ARELQAASDEDESHHDIQDNGEAFKAQELVGNDTASLAKRSSSVAPATDVQEDELNEE
                     GEVEDANDPEFDKGTASNEEAAHEESNDSEEQQTDTTVTMTHLPGADMNDELENKDGS
                     ASPVHNDQTDRDLGNEHSEIDKSGEENLERTGPVELEVPEGPGVSDDERQDSGSVKPV
                     VGDSRTEQSLAGHGGPSIDDAVDNVEEHSQDLEESSTAGDDDYLHKDSESVPTLQVSE
                     RQERDTPDPANDTLGFVEDLFESPSKGSKNKDGNLDDTSEFGKNHDEFEDDTPAEILR
                     QHDDELPLDDEEYIDLGFADGPDTVDESPVSGLRSLDNRASKRIRDPEDEFDIAENAS
                     PDLKRSRSS"

fbremer commented 7 years ago

Alright, so I discovered that on the "-" strand the phase from the wrong cds was being applied before translation. This resulted in the wrong open reading frame during translation in some cases. Should work now.

please reopen if still having this problem.

genomeannotation / GAG

Feature positions on the negative strand are not calculated correctly #166

Example 1

Example 2