Open splaisan opened 6 years ago
I have the same problem. I guess the GAG reads the #9 rather than #3 for CDS.
You could try agat_sp_gxf_to_gff3.pl
from AGAT to fix your gff file first.
tried this and it doe snot like the result
python /opt/biotools/GAG/gag.py --fasta job-133-85882ac6-9f24-45f0-ae08-edbb6552e6b7-file.fasta --gff agat_job-133_Augustus.gff3 --out gag_out_agat-gff Reading fasta... Done. Reading gff... Traceback (most recent call last): File "/opt/biotools/GAG/gag.py", line 50, in
main() File "/opt/biotools/GAG/gag.py", line 46, in main controller.execute(args) File "/opt/biotools/GAG/src/controller.py", line 74, in execute self.read_gff(gffpath, out_dir) File "/opt/biotools/GAG/src/controller.py", line 286, in read_gff genes, comments, invalids, ignored = gffreader.read_file(reader) File "/opt/biotools/GAG/src/gff_reader.py", line 336, in read_file if len(line) == 0 or line.startswith('#'): TypeError: startswith first arg must be bytes or a tuple of bytes, not str
Not sure yet what could be the cause... AGAT has removed the translations and comment lines
original GFF starts like this
##gff-version 3
# This output was generated with AUGUSTUS (version 3.3.1).
# AUGUSTUS is a gene prediction tool written by M. Stanke (mario.stanke@uni-greifswald.de),
# O. Keller, S. König, L. Gerischer, L. Romoth and Katharina Hoff.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# No extrinsic information on sequences given.
# Initialising the parameters using config directory /opt/biotools/Augustus/config/ ...
# E_coli_K12 version. Using species specific transition matrix: /opt/biotools/Augustus/config/species/E_coli_K12/E_coli_K12_trans_shadow_bacterium.pbl
# Using species specific overlap length distribution: /opt/biotools/Augustus/config/species/E_coli_K12/E_coli_K12_ovlp_len.pbl
# admissible start codons and their probabilities: ATA(0), ATC(0), ATG(0.915), ATT(0), CTG(0.000562), GTG(0.0703), TTG(0.0141)
# Looks like job-133-85882ac6-9f24-45f0-ae08-edbb6552e6b7-file.fasta is in fasta format.
# We have hints for 0 sequences and for 0 of the sequences in the input set.
#
# ----- prediction on sequence number 1 (length = 4673810, name = 000000F|arrow) -----
#
# Predicted genes for sequence number 1 on both strands
# start gene g1
000000F|arrow AUGUSTUS gene 83 2383 0.97 + . ID=g1
000000F|arrow AUGUSTUS transcript 83 2383 0.97 + . ID=g1.t1;Parent=g1
000000F|arrow AUGUSTUS start_codon 83 85 . + 0 Parent=g1.t1
000000F|arrow AUGUSTUS CDS 83 2383 0.97 + 0 ID=g1.t1.cds;Parent=g1.t1
000000F|arrow AUGUSTUS stop_codon 2381 2383 . + 0 Parent=g1.t1
# protein sequence = [MYAQTNEYGFLETPYRKVTDGVVTDEIHYLSAIEEGNYVIAQANSNLDEEGHFVEDLVTCRSKGESSLFSRDQVDYMD
# VSTQQVVSVGASLIPFLEHDDANRALMGANMQRQAVPTLRADKPLVGTGMERAVAVDSGVTAVAKRGGVVQYVDASRIVIKVNEDEMYPGEAGIDIYN
# LTKYTRSNQNTCINQMPCVSLGEPVERGDVLADGPSTDLGELALGQNMRVAFMPWNGYNFEDSILVSERVVQEDRFTTIHIQELACVSRDTKLGPEEI
# TADIPNVGEAALSKLDESGIVYIGAEVTGGDILVGKVTPKGETQLTPEEKLLRAIFGEKASDVKDSSLRVPNGVSGTVIDVQVFTRDGVEKDKRALEI
# EEMQLKQAKKDLSEELQILEAGLFSRIRAVLVAGGVEAEKLDKLPRDRWLELGLTDEEKQNQLEQLAEQYDELKHEFEKKLEAKRRKITQGDDLAPGV
# LKIVKVYLAVKRRIQPGDKMAGRHGNKGVISKINPIEDMPYDENGTPVDIVLNPLGVPSRMNIGQILETHLGMAAKGIGDKINAMLKQQQEVAKLREF
# IQRAYDLGADVRQKVDLSTFSDEEVMRLAENLRKGMPIATPVFDGAKEAEIKELLKLGDLPTSGQIRLYDGRTGEQFERPVTVGYMYMLKLNHLVDDK
# MHARSTGSYSLVTQQPLGGKAQFGGQRFGEMEVWALEAYGAAYTLQEMLTVKSDDVNGRTKMYKNIVDGNHQMEPGMPESFNVLLKEIRSLGINIELE
# DE]
# end gene g1
###
# start gene g2
000000F|arrow AUGUSTUS gene 2460 6683 1 + . ID=g2
000000F|arrow AUGUSTUS transcript 2460 6683 1 + . ID=g2.t1;Parent=g2
000000F|arrow AUGUSTUS start_codon 2460 2462 . + 0 Parent=g2.t1
000000F|arrow AUGUSTUS CDS 2460 6683 1 + 0 ID=g2.t1.cds;Parent=g2.t1
000000F|arrow AUGUSTUS stop_codon 6681 6683 . + 0 Parent=g2.t1
# protein sequence = [MKDLLKFLKAQTKTEEFDAIKIALASPDMIRSWSFGEVKKPETINYRTFKPERDGLFCARIFGPVKDYECLCGKYKRL
# KHRGVICEKCGVEVTQTKVRRERMGHIELASPTAHIWFLKSLPSRIGLLLDMPLRDIERVLYFESYVVIEGGMTNLERQQILTEEQYLDALEEFGDEF
# DAKMGAEAIQALLKSMDLEQECEQLREELNETNSETKRKKLTKRIKLLEAFVQSGNKPEWMILTVLPVLPPDLRPLVPLDGGRFATSDLNDLYRRVIN
# RNNRLKRLLDLAAPDIIVRNEKRMLQEAVDALLDNGRRGRAITGSNKRPLKSLADMIKGKQGRFRQNLLGKRVDYSGRSVITVGPYLRLHQCGLPKKM
# ALELFKPFIYGKLELRGLATTIKAAKKMVEREEAVVWDILDEVIREHPVLLNRAPTLHRLGIQAFEPVLIEGKAIQLHPLVCAAYNADFDGDQMAVHV
# PLTLEAQLEARALMMSTNNILSPANGEPIIVPSQDVVLGLYYMTRDCVNAKGEGMVLTGPKEAERLYRSGLASLHARVKVRITEYEKDANGELVAKTS
# LKDTTVGRAILWMIVPKGLPYSIVNQALGKKAISKMLNTCYRILGLKPTVIFADQIMYTGFAYAARSGASVGIDDMVIPEKKHEIISEAEAEVAEIQE
# QFQSGLVTAGERYNKVIDIWAAANDRVSKAMMDNLQTETVINRDGQEEKQVSFNSIYMMADSGARGSAAQIRQLAGMRGLMAKPDGSIIETPITANFR
# EGLNVLQYFISTHGARKGLADTALKTANSGYLTRRLVDVAQDLVVTEDDCGTHEGIMMTPVIEGGDVKEPLRDRVLGRVTAEDVLKPGTADILVPRNT
# LLHEQWCDLLEENSVDAVKVRSVVSCDTDFGVCAHCYGRDLARGHIINKGEAIGVIAAQSIGEPGTQLTMRTFHIGGAASRAAAESSIQVKNKGSIKL
# SNVKSVVNSSGKLVITSRNTELKLIDEFGRTKESYKVPYGAVLAKGDGEQVAGGETVANWDPHTMPVITEVSGFVRFTDMIDGQTITRQTDELTGLSS
# LVVLDSAERTAGGKDLRPALKIVDAQGNDVLIPGTDMPAQYFLPGKAIVQLEDGVQISSGDTLARIPQESGGTKDITGGLPRVADLFEARRPKEPAIL
# AEISGIVSFGKETKGKRRLVITPVDGSDPYEEMIPKWRQLNVFEGERVERGDVISDGPEAPHDILRLRGVHAVTRYIVNEVQDVYRLQGVKINDKHIE
# VIVRQMLRKATIVNAGSSDFLEGEQVEYSRVKIANRELEANGKVGATYSRDLLGITKASLATESFISAASFQETTRVLTEAAVAGKRDELRGLKENVI
# VGRLIPAGTGYAYHQDRMRRRAAGEAPAAPQVTAEDASASLAELLNAGLGGSDNE]
# end gene g2
###
...
agat-fixed like this
##gff-version 3
# This output was generated with AUGUSTUS (version 3.3.1).
# AUGUSTUS is a gene prediction tool written by M. Stanke (mario.stanke@uni-greifswald.de),
# O. Keller, S. König, L. Gerischer, L. Romoth and Katharina Hoff.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# No extrinsic information on sequences given.
# Initialising the parameters using config directory /opt/biotools/Augustus/config/ ...
# E_coli_K12 version. Using species specific transition matrix: /opt/biotools/Augustus/config/species/E_coli_K12/E_coli_K12_trans_shadow_bacterium.pbl
# Using species specific overlap length distribution: /opt/biotools/Augustus/config/species/E_coli_K12/E_coli_K12_ovlp_len.pbl
# admissible start codons and their probabilities: ATA(0), ATC(0), ATG(0.915), ATT(0), CTG(0.000562), GTG(0.0703), TTG(0.0141)
# Looks like job-133-85882ac6-9f24-45f0-ae08-edbb6552e6b7-file.fasta is in fasta format.
# We have hints for 0 sequences and for 0 of the sequences in the input set.
#
# ----- prediction on sequence number 1 (length = 4673810, name = 000000F|arrow) -----
#
# Predicted genes for sequence number 1 on both strands
# start gene g1
000000F|arrow AUGUSTUS gene 83 2383 0.97 + . ID=g1
000000F|arrow AUGUSTUS transcript 83 2383 0.97 + . ID=g1.t1;Parent=g1
000000F|arrow AUGUSTUS exon 83 2383 0.97 + . ID=nbis_NEW-exon-3283;Parent=g1.t1
000000F|arrow AUGUSTUS CDS 83 2383 0.97 + 0 ID=g1.t1.cds;Parent=g1.t1
000000F|arrow AUGUSTUS start_codon 83 85 . + 0 ID=start_codon-1;Parent=g1.t1
000000F|arrow AUGUSTUS stop_codon 2381 2383 . + 0 ID=stop_codon-1;Parent=g1.t1
000000F|arrow AUGUSTUS gene 2460 6683 1 + . ID=g2
000000F|arrow AUGUSTUS transcript 2460 6683 1 + . ID=g2.t1;Parent=g2
000000F|arrow AUGUSTUS exon 2460 6683 1 + . ID=nbis_NEW-exon-924;Parent=g2.t1
000000F|arrow AUGUSTUS CDS 2460 6683 1 + 0 ID=g2.t1.cds;Parent=g2.t1
000000F|arrow AUGUSTUS start_codon 2460 2462 . + 0 ID=start_codon-2;Parent=g2.t1
000000F|arrow AUGUSTUS stop_codon 6681 6683 . + 0 ID=stop_codon-2;Parent=g2.t1
Then it could be due to the empty commented line
#
Could you try to trow all lines starting by #
prior using GAG?
Otherwise you could use EMBLmyGFF3 to submit via ENA instead of NCBI (the data will end up at the same place at the end), I know EMBLmyGFF3 works fine with Augustus annotation.
Merci Jacques,
I kept only the shebang line and removed all other ^# before applying agat and still error with GAG. What can be this string type error?
EMBLmyGFF3 is python2-only and I cannot install it right now using conda
python /opt/biotools/GAG/gag.py --fasta job-133-85882ac6-9f24-45f0-ae08-edbb6552e6b7-file.fasta --gff agat.gff3 --out gag_out_agat-gff
Reading fasta...
Done.
Reading gff...
Traceback (most recent call last):
File "/opt/biotools/GAG/gag.py", line 50, in <module>
main()
File "/opt/biotools/GAG/gag.py", line 46, in main
controller.execute(args)
File "/opt/biotools/GAG/src/controller.py", line 74, in execute
self.read_gff(gffpath, out_dir)
File "/opt/biotools/GAG/src/controller.py", line 286, in read_gff
genes, comments, invalids, ignored = gffreader.read_file(reader)
File "/opt/biotools/GAG/src/gff_reader.py", line 336, in read_file
if len(line) == 0 or line.startswith('#'):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
##gff-version 3
000003F|arrow AUGUSTUS gene 44 1345 1 + . ID=g4163
000003F|arrow AUGUSTUS transcript 44 1345 1 + . ID=g4163.t1;Parent=g4163
000003F|arrow AUGUSTUS exon 44 1345 1 + . ID=nbis_NEW-exon-3212;Parent=g4163.t1
000003F|arrow AUGUSTUS CDS 44 1345 1 + 0 ID=g4163.t1.cds;Parent=g4163.t1
000003F|arrow AUGUSTUS start_codon 44 46 . + 0 ID=start_codon-4163;Parent=g4163.t1
000003F|arrow AUGUSTUS stop_codon 1343 1345 . + 0 ID=stop_codon-4163;Parent=g4163.t1
000003F|arrow AUGUSTUS gene 2698 3009 0.76 + . ID=g4164
000003F|arrow AUGUSTUS transcript 2698 3009 0.76 + . ID=g4164.t1;Parent=g4164
000003F|arrow AUGUSTUS exon 2698 3009 0.76 + . ID=nbis_NEW-exon-711;Parent=g4164.t1
000003F|arrow AUGUSTUS CDS 2698 3009 0.76 + 0 ID=g4164.t1.cds;Parent=g4164.t1
000003F|arrow AUGUSTUS start_codon 2698 2700 . + 0 ID=start_codon-4164;Parent=g4164.t1
000003F|arrow AUGUSTUS stop_codon 3007 3009 . + 0 ID=stop_codon-4164;Parent=g4164.t1
000003F|arrow AUGUSTUS gene 3206 3394 0.73 + . ID=g4165
000003F|arrow AUGUSTUS transcript 3206 3394 0.73 + . ID=g4165.t1;Parent=g4165
000003F|arrow AUGUSTUS exon 3206 3394 0.73 + . ID=nbis_NEW-exon-618;Parent=g4165.t1
000003F|arrow AUGUSTUS CDS 3206 3394 0.73 + 0 ID=g4165.t1.cds;Parent=g4165.t1
000003F|arrow AUGUSTUS start_codon 3206 3208 . + 0 ID=start_codon-4165;Parent=g4165.t1
000003F|arrow AUGUSTUS stop_codon 3392 3394 . + 0 ID=stop_codon-4165;Parent=g4165.t1
000003F|arrow AUGUSTUS gene 3358 3669 0.93 + . ID=g4166
000003F|arrow AUGUSTUS transcript 3358 3669 0.93 + . ID=g4166.t1;Parent=g4166
000003F|arrow AUGUSTUS exon 3358 3669 0.93 + . ID=nbis_NEW-exon-1365;Parent=g4166.t1
000003F|arrow AUGUSTUS CDS 3358 3669 0.93 + 0 ID=g4166.t1.cds;Parent=g4166.t1
000003F|arrow AUGUSTUS start_codon 3358 3360 . + 0 ID=start_codon-4166;Parent=g4166.t1
000003F|arrow AUGUSTUS stop_codon 3667 3669 . + 0 ID=stop_codon-4166;Parent=g4166.t1
000003F|arrow AUGUSTUS gene 3702 4157 1 - . ID=g4167
000003F|arrow AUGUSTUS transcript 3702 4157 1 - . ID=g4167.t1;Parent=g4167
000003F|arrow AUGUSTUS exon 3702 4157 1 - . ID=nbis_NEW-exon-2264;Parent=g4167.t1
000003F|arrow AUGUSTUS CDS 3702 4157 1 - 0 ID=g4167.t1.cds;Parent=g4167.t1
000003F|arrow AUGUSTUS start_codon 4155 4157 . - 0 ID=start_codon-4167;Parent=g4167.t1
000003F|arrow AUGUSTUS stop_codon 3702 3704 . - 0 ID=stop_codon-4167;Parent=g4167.t1
000003F|arrow AUGUSTUS gene 4301 4492 0.98 - . ID=g4168
000003F|arrow AUGUSTUS transcript 4301 4492 0.98 - . ID=g4168.t1;Parent=g4168
000003F|arrow AUGUSTUS exon 4301 4492 0.98 - . ID=nbis_NEW-exon-1181;Parent=g4168.t1
000003F|arrow AUGUSTUS CDS 4301 4492 0.98 - 0 ID=g4168.t1.cds;Parent=g4168.t1
000003F|arrow AUGUSTUS start_codon 4490 4492 . - 0 ID=start_codon-4168;Parent=g4168.t1
000003F|arrow AUGUSTUS stop_codon 4301 4303 . - 0 ID=stop_codon-4168;Parent=g4168.t1
...
EMBLmyGFF3 v2 is in python3
this is not what conda tells me :-)
(agat) u0002316@gbw-s-pacbio01:~$ python --version
Python 3.6.10 :: Anaconda, Inc.
(agat) u0002316@gbw-s-pacbio01:~$ conda install -c bioconda emblmygff3
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: |
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:
Specifications:
- emblmygff3 -> python[version='<3']
Your python: python=3.6
If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.
I had just pushed it into Bioconda, it was maybe not yet on their server. I checked now and it's there. Let me know if you still don't see it.
I ran GAG on a Augustus output of E.Coli and only find genes in the GAG output. Both transcripts and CDS (present in the input) are not transferred to genome.mrna.fasta and genome.proteins.fasta.
The features are also absent in other gag gff outputs (.ignored .invalid)!
-rw-r--r-- 1 u0002316 domain users 1.6M Aug 16 12:14 genome.comments.gff -rw-r--r-- 1 u0002316 domain users 4.5M Aug 16 12:14 genome.fasta -rw-r--r-- 1 u0002316 domain users 238K Aug 16 12:14 genome.gff -rw-r--r-- 1 u0002316 domain users 332K Aug 16 12:14 genome.ignored.gff -rw-r--r-- 1 u0002316 domain users 587K Aug 16 12:14 genome.invalid.gff -rw-r--r-- 1 u0002316 domain users 0 Aug 16 12:14 genome.mrna.fasta -rw-r--r-- 1 u0002316 domain users 0 Aug 16 12:14 genome.proteins.fasta -rw-r--r-- 1 u0002316 domain users 0 Aug 16 12:14 genome.removed.gff -rw-r--r-- 1 u0002316 domain users 1.8K Aug 16 12:14 genome.stats -rw-r--r-- 1 u0002316 domain users 161K Aug 16 12:14 genome.tbl
my input looks like below, what is wrong with it? Thanks