Closed SC-Duan closed 3 years ago
Hi, I haven't encountered a problem with those scripts yet. Please make sure that you used the current versions of TSEBRA and BRAKER for your runs. Reading the error messages, I would guess that somehow the coordinates of one or more transcripts exceed the end coordinate of their sequence in your genome file. Another reason could be that something went wrong with the coordinates of transcript 'g9042.t1'. Can you post the lines from the gtf file of 'g9042.t1'? Best, Lars
Hi, the all versions are updated. I check the aa file and I think it is ok, but I do not known why it tells "Use of uninitialized value in concatenation (.) or string at ~/Augustus/scripts/gtf2aa.pl line 208.".
For the error with "Coordinate of g9042.t1 out of sequence range. 252844 >= 252844.", the total length of the chromosome (which g9042.t1 in) is 252844. and I check the gffGetmRNA.pl and found the statement:
die ("Coordinate of $txid out of sequence range. $exon->[4] >= " . length($sequence{$chr}) . "\n") if ($exon->[4] >= length($sequence{$chr}));
So if the gene locates the end of chromosome, it get the error. Maybe it should be corrected as: if ($exon->[4] > length($sequence{$chr}));
Again, I found that some genes overlap with repeats, should these genes be discarded?
Hi, I think the error in gtf2aa.pl is due to the same special case as in gffGetmRNA.pl.
Did you mask your genome sequence for repeats and did you use the option --softmasking
for your BRAKER runs?
Yes, the genome is masked and I use the option --softmasking.
There was an issue in the BRAKER repository addressing predicted genes overlapping with repeats: https://github.com/Gaius-Augustus/BRAKER/issues/362 You should probably remove only the genes where almost all of the CDS overlap with repeats. Or you could rerun BRAKER and increase the repeat penalty.
OK. Thank you very much! I will try it.
Hello!
I would like to explain the process and problems that we had so far to obtain mRNA, CDS and proteins after running TSEBRA . This software is great and easy to use. However, I would suggest to add or integrate in it the automatic extraction of mRNA, CDS and proteins (as BRAKER does).
Briefly What works for me is:
#!/bin/bash
gtf_file=$1 # Output of TSEBRA
ttable=gtf_translationtable.txt
out_file="${gtf_file}.renamed.gtf"
$HOME/TSEBRA/bin/rename_gtf.py --gtf $gtf_file --translation_tab $ttable --out $out_file
gtf2gff.pl
(This step is optional since gffread can work with both format of files).gffread
ignoring the Warnings that appear (as suggested by @LarsGab in #457 BRAKER issue
Warning: invalid GTF record, transcript_id not found: scf1 AUGUSTUS transcript 347180 347404 0.56 + . g56765.t1
#!/bin/bash
# Input files
genome=$HOME/mygenome.fasta
gtf_file=Tsebra.braker1+2_combined.gtf.renamed.gtf
# Output files
mrna_file="tsebra_gtf_gffread.mRNA.fasta"
cds_file="tsebra_gtf_gffread.CDS.fasta"
prot_file="tsebra_gtf_gffread.prot.fasta"
# Commands
gffread -w $mrna_file -g $genome $gtf_file
gffread -x $cds_file -g $genome $gtf_file
gffread -y $prot_file -g $genome $gtf_file
echo "# DONE #"
NOTE: I also tried to use Augustus scripts (gffGetmRNA.pl
and gtf2aa.pl
) to obtain the mRNA and proteins. However these scripts didn't worked for me.
gffGetmRNA.pl
gave this kind of errors:
Use of uninitialized value in concatenation (.) or string at /users/Programs/Augustus/scripts/gtf2aa.pl line 208.
gtf2aa.pl
retrieved correctly the great majority of proteins except for 80 cases out of ~60,000 that corresponded to transcripts that have a reading frame "1" or "2". This looks like some bug.Thanks,
Hi, I run TSEBRA and got the gtf file without UTRs. Then I want to get aa and CDS files according to gtf file using gtf2aa.pl and gffGetmRNA.pl in Augustus. I got lots of warnings for gtf2aa.pl: Use of uninitialized value in concatenation (.) or string at ~/Augustus/scripts/gtf2aa.pl line 208. Use of uninitialized value in concatenation (.) or string at ~/Augustus/scripts/gtf2aa.pl line 208. Use of uninitialized value in concatenation (.) or string at ~/Augustus/scripts/gtf2aa.pl line 208. ..........
and gffGetmRNA.pl stops with an error: Coordinate of g9042.t1 out of sequence range. 252844 >= 252844.
My gtf file: 1 AUGUSTUS gene 1 686 . + . g1 1 AUGUSTUS transcript 1 686 . + . g1.t1 1 AUGUSTUS CDS 1 686 0.7 + 2 transcript_id "g1.t1"; gene_id "g1"; 1 AUGUSTUS exon 1 686 . + . transcript_id "g1.t1"; gene_id "g1"; 1 AUGUSTUS stop_codon 684 686 . + 0 transcript_id "g1.t1"; gene_id "g1";
Can you give me some help? Thank you!