Closed thobrock closed 7 years ago
I'm just another user and not a developer, but if there are errors in the .tbl file they should be found by the discrepancy report produced by tbl2asn when converting to sqn/gbk flatfile.
We will take a look on Monday. I think this only impacts the printing of the .pep file and not the .tbl but will track it down. scott
On Wed, Aug 3, 2016 at 2:22 AM, Jon Palmer notifications@github.com wrote:
I'm just another user and not a developer, but if there are errors in the .tbl file they should be found by the discrepancy report produced by tbl2asn when converting to sqn/gbk flatfile.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/genomeannotation/GAG/issues/166#issuecomment-237221944, or mute the thread https://github.com/notifications/unsubscribe-auth/AFgKekzHuOMat1t1p6Udj0WvpMMAbplMks5qcIfqgaJpZM4JbZKx .
Thanks for investigating this problem.
It looks like GAG was pulling the sequences in increasing order of coordinates regardless of the strand. We made it so it pulls sequences in decreasing order of coordinates for the '-' strand. Could you check to see if that fixed your problem? If not, could you provide a small subset of your data?
Thanks for the update. But it seems it introduced an additional error (without using your last commit, this error doesn't appear).
Command:
python gag.py --fasta Athaliana_167_TAIR9.fa --gff Athaliana_167_TAIR10.gene_exons.gff3 --fix_terminal_ns --fix_start_stop --remove_introns_shorter_than 10 --out outputFolder.GAG
Error message:
Traceback (most recent call last):
File "/data4/Programs/GAG/gag.py", line 45, in <module>
main()
File "/data4/Programs/GAG/gag.py", line 42, in main
controller.execute(args)
File "/data4/Programs/GAG/src/controller.py", line 50, in execute
self.stats_mgr.update_ref(seq.stats())
File "/data4/Programs/GAG/src/sequence.py", line 486, in stats
stats["Shortest intron"] = int(self.get_shortest_intron())
File "/data4/Programs/GAG/src/sequence.py", line 383, in get_shortest_intron
length = gene.get_shortest_intron()
File "/data4/Programs/GAG/src/gene.py", line 192, in get_shortest_intron
length = mrna.get_shortest_intron()
File "/data4/Programs/GAG/src/xrna.py", line 315, in get_shortest_intron
raise Exception("Intron with negative length on "+self.name)
AttributeError: XRNA instance has no attribute 'name'
(I used the files 'Athaliana_167_TAIR9.fa' and 'Athaliana_167_TAIR10.gene_exons.gff3' downloaded from Phytozome11 for testing.)
The latest commit doesn't work - the script crashed before producing the files. So I still used the version without the latest commit, because I also can produce the protein sequence with a different script.
But I also realized that the produced GFF file contains wrong positions of the detected Start- and Stop-codons (using -fix_start_stop). Maybe this error is also related to this issue, because it seems it only happens on the negative strand. Does this also influences the final tba file?
PS1: Thanks for creating this script. And I hope this problem can be fixed that I can use it for my files.
PS2: @tedsta Do you still need a subset or example files to investigate this problem?
I'll take a look at this today. Example files would certainly help. Thank you for the report
Alright, I fixed the error in get_shortest_intron which wasn't handling indices ordered for the '-' strand. @bioinf-ice, please run with your data and let me know if you are getting expected output.
Thank you for bringing this to our attention
@Woods26 Thanks for your response. But I can't see a new commit. Did you forget to push your edits to github?
Sorry, I forgot to push. Should be visible now: 0baef1e
P.S. looks like some of the internals are assuming order, so tomorrow I will explore an alternate method of solving #165
let me know what your testing reveals if you get a chance.
Sorry to say this, but it still doesn't solve the problem. Several protein sequences seem not correct and also the statistics for the CDS changed drastically (in genome.stats). I will try to provide a small test file soon.
Edit: Here is an full example using two transcipts from the Athaliana_167_TAIR10 from Phytozome11. This example produces incorrect protein sequences and also the GAG script does not detect the stop codon at the end and therefor it reports them incorrect in the gff file.
Input files: example.gff3.zip example_genome.fa.zip
Output files produced by GAG: GAG_output.zip
Corrrect protein sequences: genome.proteins.correct.fasta.zip
I used this command to run the GAG script:
python gag.py --fasta Athaliana_167_TAIR9.fa --gff Athaliana_167_TAIR10.gene_exons.gff3 --fix_terminal_ns --fix_start_stop --remove_introns_shorter_than 10 --out GAG
I hope this helps to investigate the problem further.
And thanks for your time for improving and correcting this tool. I really appreciate it.
PS: I hope zip files are OK. But otherwise github doesn't accept the original file format.
zips should be fine. Thank you for these resources
Does the most recent commit https://github.com/genomeannotation/GAG/commit/98da78e2a531ab4711e335e3e313a701c134370d works as expected? Sorry, but I hadn't time to test it by myself so far.
The previous version also had some errors regarding detecting the correct position of the start/stop codons and the correct length of introns (to remove the short introns). This errors were introduced in the processed GFF file and therefor still present in the final tbl file.
In the end, I used my own scripts to perform this steps to get a working tbl file for submission.
PS: I edited the title from 'Protein sequences are not correct' to 'Feature positions on the negative strand are not calculated correctly', because I think all this errors are related and the title is more suitable.
No, not working yet. Thank you, that seems like a good name change.
Hi all, The protein translation is still not right on the reverse strand, here is an example I just ran with the development branch. The positions in the tbl file are all correct, but the translation is wrong.
GAG genome.gff file
##gff-version 3
chr1 EVM gene 3964916 3966867 . - . ID=AFUI_01436
chr1 EVM mRNA 3964916 3966867 . - . ID=AFUI_01436-T1;Parent=AFUI_01436
chr1 EVM exon 3964916 3965828 . - . ID=evm.model.unplaced_39.51.exon4;Parent=AFUI_01436-T1
chr1 EVM exon 3965895 3966091 . - . ID=evm.model.unplaced_39.51.exon3;Parent=AFUI_01436-T1
chr1 EVM exon 3966145 3966588 . - . ID=evm.model.unplaced_39.51.exon2;Parent=AFUI_01436-T1
chr1 EVM exon 3966649 3966867 . - . ID=evm.model.unplaced_39.51.exon1;Parent=AFUI_01436-T1
chr1 EVM CDS 3964916 3965828 . - 1 ID=cds.evm.model.unplaced_39.51;Parent=AFUI_01436-T1
chr1 EVM CDS 3965895 3966091 . - 0 ID=cds.evm.model.unplaced_39.51;Parent=AFUI_01436-T1
chr1 EVM CDS 3966145 3966588 . - 0 ID=cds.evm.model.unplaced_39.51;Parent=AFUI_01436-T1
chr1 EVM CDS 3966649 3966867 . - 0 ID=cds.evm.model.unplaced_39.51;Parent=AFUI_01436-T1
chr1 EVM start_codon 3966865 3966867 . - . ID=AFUI_01436-T1:start;Parent=AFUI_01436-T1
chr1 EVM stop_codon 3964916 3964918 . - . ID=AFUI_01436-T1:stop;Parent=AFUI_01436-T1
chr1 EVM gene 3967096 3969745 . + . ID=AFUI_01437
chr1 EVM mRNA 3967096 3969745 . + . ID=AFUI_01437-T1;Parent=AFUI_01437
chr1 EVM exon 3967096 3968205 . + . ID=evm.model.unplaced_39.52.exon1;Parent=AFUI_01437-T1
chr1 EVM exon 3968258 3969745 . + . ID=evm.model.unplaced_39.52.exon2;Parent=AFUI_01437-T1
chr1 EVM CDS 3967096 3968205 . + 0 ID=cds.evm.model.unplaced_39.52;Parent=AFUI_01437-T1
chr1 EVM CDS 3968258 3969745 . + 0 ID=cds.evm.model.unplaced_39.52;Parent=AFUI_01437-T1
chr1 EVM start_codon 3967096 3967098 . + . ID=AFUI_01437-T1:start;Parent=AFUI_01437-T1
chr1 EVM stop_codon 3969743 3969745 . + . ID=AFUI_01437-T1:stop;Parent=AFUI_01437-T1
Proteins output by GAG
>AFUI_01436-T1 protein
**IP*KDAPREASCAAVSEGVARCTDQRGRREQQATPASMRQISQTTPRRSSPSFYIATIGF*H*YPTIGQPWLFSQHQYALGGAKQCSHVNSGRSPAPGLKPCSRPFQSSQILEPSIQPLNRTMFALSTNLLTSCTSSSSRTANPTSFKTLIAFIFSPK*PRASARAWTSGRLFAMPSNCLARSTKSSRWDTERICRFLRSRHSWKWKVTKREYRKLSKETKSLKPARSAKERRSSWRCSERRQLAVAGLWHPGPHPILYTLLHHVLLLPKHTILMRRRRRSHLRNRCPHAEKACNLAKNRRPQISTRKFVVIWAPKLMNRVHW*LPKFRRRQPRGSPPLGPHYPPTESLSISLLLKQFLPNSPVKVP*SRSR*RATSSSVSPTHLSRKSSWTCWPILHMAHNSAPIPMLTRPFSPAPQPFS*KT*PSVSQPTTRLASFVGALLVRGQKMLISSPSRSQCG*IRAPTPPL*QSSMSSRARIR*ETLLFPYLLAQQSLQSLASTLSMRSLGIVWTGTSVPWTRPTLQEALNLNLQAMETKTSSSR*MCASPKSVHSLR*MSPMSLYSRWKGRAQDSQKM*EALRRVTLSN
>AFUI_01437-T1 protein
MATASSLAFTNMEVPGLDDTMEMASPYQGHVDDFDIDLDVMEDQASNADKDMTAADDYPDTLHDEDINQDGPRDADMIDDVAEPTMVDADDQYQEASHNVEMQYGIEESHEEEMLEDEYVDDADTAAPEYQDVQVCDNTDKPGTEEAGSAEYPVAGDHAQEPAVGAHLEAKPKYHDGNGHELQGALDQRNDDADQPPDTHQPEIAVDTINSTNETDQVDSALKPLDAAKTRPDPGHPEGSDDYNEGSDEHASQISQPVEEHGAAGDEGTDEQAAQDLGLQGPEQLSEHEPDTTGQVALHPVKVYYQENEISLFPPHEGDSSETFFLEDESLAYEPLGKLFESCREVLYEHVGENETLVLDIESLNLQLMEDSTHISKVTLHQIIDVYLRLCHNDGIESPEPLYLTLSTKSTISAEVSNLLLAASEGKGLSEINLWDDYQEVDEQLGEVSEAHDEGAHDEEEEEEEEEEEEEEEEEEEEEEEEEEKEEAQGQEEEPISPNRPMTNQNVDARELQAASDEDESHHDIQDNGEAFKAQELVGNDTASLAKRSSSVAPATDVQEDELNEEGEVEDANDPEFDKGTASNEEAAHEESNDSEEQQTDTTVTMTHLPGADMNDELENKDGSASPVHNDQTDRDLGNEHSEIDKSGEENLERTGPVELEVPEGPGVSDDERQDSGSVKPVVGDSRTEQSLAGHGGPSIDDAVDNVEEHSQDLEESSTAGDDDYLHKDSESVPTLQVSERQERDTPDPANDTLGFVEDLFESPSKGSKNKDGNLDDTSEFGKNHDEFEDDTPAEILRQHDDELPLDDEEYIDLGFADGPDTVDESPVSGLRSLDNRASKRIRDPEDEFDIAENASPDLKRSRSS*
Genome tbl file
3966867 3964916 gene
locus_tag AFUI_01436
3966867 3966649 mRNA
3966588 3966145
3966091 3965895
3965828 3964916
product hypothetical protein
protein_id gnl|ncbi|AFUI_01436-T1
transcript_id gnl|ncbi|AFUI_01436-T1_mrna
3966867 3966649 CDS
3966588 3966145
3966091 3965895
3965828 3964916
codon_start 1
product hypothetical protein
protein_id gnl|ncbi|AFUI_01436-T1
transcript_id gnl|ncbi|AFUI_01436-T1_mrna
3967096 3969745 gene
locus_tag AFUI_01437
3967096 3968205 mRNA
3968258 3969745
product hypothetical protein
protein_id gnl|ncbi|AFUI_01437-T1
transcript_id gnl|ncbi|AFUI_01437-T1_mrna
3967096 3968205 CDS
3968258 3969745
codon_start 1
product hypothetical protein
protein_id gnl|ncbi|AFUI_01437-T1
transcript_id gnl|ncbi|AFUI_01437-T1_mrna
Converted to GBK via tbl2asn
gene complement(3964916..3966867)
/locus_tag="AFUI_01436"
mRNA complement(join(3964916..3965828,3965895..3966091,
3966145..3966588,3966649..3966867))
/locus_tag="AFUI_01436"
/product="hypothetical protein"
CDS complement(join(3964916..3965828,3965895..3966091,
3966145..3966588,3966649..3966867))
/locus_tag="AFUI_01436"
/codon_start=1
/product="hypothetical protein"
/protein_id="ncbi:AFUI_01436-T1"
/translation="MIDSVEGRTQRSFLCCGKRGGSAVHGPTGTKRTASDSGFNEADI
PNNPQKKQPFLLHSNHRFLTLIPNNRATMVVLAASICTRGGKAVLSRQFREIARSRIE
ALLASFPKLADSGTQHTTVEQDNVRFVYQPLDELYIVLITNRQSNILQDIDSLHLFAQ
VTTSICKSLDEREIVRNAFELLSAFDEIVTLGYRENLSLSQIKTFLEMESHEERIQEI
IERNKELEASEERKRKAKQLEMQRKEAARSGRSMAPRAPSYPVYTPPSRPAAPETYDT
YEAEKKKSFAKPLPTRGKGMQLGKKSKTTDIYEKVRGDLGPEVDESSPLVTPQVPTPA
AERVPSARASLSADREPVHITIAETISAKLTREGALKSFEVKGDLQLRITDPSFTKIK
LDLLANPTHGAQFRTHPNVDKAVFTSSSAIQLKDLTKRFPANNSIGVLRWRVASSGSE
NADILPITFTVWVNKGSDSTTVTIEYELTGSDTLRDVVVSIPFGATEPTVSSFDAVYE
VSGDSLDWNIGTVDETNASGSFEFESAGDGDENEFFPMNVRFSKVSPFVEVDVTNVSL
LEMEGESTGFSKDVRSIAEGYVIE"
gene 3967096..3969745
/locus_tag="AFUI_01437"
mRNA join(3967096..3968205,3968258..3969745)
/locus_tag="AFUI_01437"
/product="hypothetical protein"
CDS join(3967096..3968205,3968258..3969745)
/locus_tag="AFUI_01437"
/codon_start=1
/product="hypothetical protein"
/protein_id="ncbi:AFUI_01437-T1"
/translation="MATASSLAFTNMEVPGLDDTMEMASPYQGHVDDFDIDLDVMEDQ
ASNADKDMTAADDYPDTLHDEDINQDGPRDADMIDDVAEPTMVDADDQYQEASHNVEM
QYGIEESHEEEMLEDEYVDDADTAAPEYQDVQVCDNTDKPGTEEAGSAEYPVAGDHAQ
EPAVGAHLEAKPKYHDGNGHELQGALDQRNDDADQPPDTHQPEIAVDTINSTNETDQV
DSALKPLDAAKTRPDPGHPEGSDDYNEGSDEHASQISQPVEEHGAAGDEGTDEQAAQD
LGLQGPEQLSEHEPDTTGQVALHPVKVYYQENEISLFPPHEGDSSETFFLEDESLAYE
PLGKLFESCREVLYEHVGENETLVLDIESLNLQLMEDSTHISKVTLHQIIDVYLRLCH
NDGIESPEPLYLTLSTKSTISAEVSNLLLAASEGKGLSEINLWDDYQEVDEQLGEVSE
AHDEGAHDEEEEEEEEEEEEEEEEEEEEEEEEEEKEEAQGQEEEPISPNRPMTNQNVD
ARELQAASDEDESHHDIQDNGEAFKAQELVGNDTASLAKRSSSVAPATDVQEDELNEE
GEVEDANDPEFDKGTASNEEAAHEESNDSEEQQTDTTVTMTHLPGADMNDELENKDGS
ASPVHNDQTDRDLGNEHSEIDKSGEENLERTGPVELEVPEGPGVSDDERQDSGSVKPV
VGDSRTEQSLAGHGGPSIDDAVDNVEEHSQDLEESSTAGDDDYLHKDSESVPTLQVSE
RQERDTPDPANDTLGFVEDLFESPSKGSKNKDGNLDDTSEFGKNHDEFEDDTPAEILR
QHDDELPLDDEEYIDLGFADGPDTVDESPVSGLRSLDNRASKRIRDPEDEFDIAENAS
PDLKRSRSS"
Alright, so I discovered that on the "-" strand the phase from the wrong cds was being applied before translation. This resulted in the wrong open reading frame during translation in some cases. Should work now.
please reopen if still having this problem.
This issues seems to be related to this previously closed one: https://github.com/genomeannotation/GAG/issues/165
Several protein sequences reported in the genome.proteins.fasta are not correct and have a frameshift sometimes 2 and sometimes 3). I checked randomly some sequences and it seems the incorrect genes are only located on the negative strand.
I'm wondering, if this problem only occurs during the producing of the protein file or if there are also errors in the final tbl file.
Here are two examples, I hope this helps to investigate the problem:
Example 1
GFF:
Incorrect protein sequence:
Correct CDS (produced with a different tool):
Example 2
GFF:
Incorrect protein sequence:
Correct CDS (produced with a different tool):