How to get aa and CDS sequences

SC-Duan commented 3 years ago

Hi, I run TSEBRA and got the gtf file without UTRs. Then I want to get aa and CDS files according to gtf file using gtf2aa.pl and gffGetmRNA.pl in Augustus. I got lots of warnings for gtf2aa.pl: Use of uninitialized value in concatenation (.) or string at ~/Augustus/scripts/gtf2aa.pl line 208. Use of uninitialized value in concatenation (.) or string at ~/Augustus/scripts/gtf2aa.pl line 208. Use of uninitialized value in concatenation (.) or string at ~/Augustus/scripts/gtf2aa.pl line 208. ..........

and gffGetmRNA.pl stops with an error: Coordinate of g9042.t1 out of sequence range. 252844 >= 252844.

My gtf file: 1 AUGUSTUS gene 1 686 . + . g1 1 AUGUSTUS transcript 1 686 . + . g1.t1 1 AUGUSTUS CDS 1 686 0.7 + 2 transcript_id "g1.t1"; gene_id "g1"; 1 AUGUSTUS exon 1 686 . + . transcript_id "g1.t1"; gene_id "g1"; 1 AUGUSTUS stop_codon 684 686 . + 0 transcript_id "g1.t1"; gene_id "g1";

Can you give me some help? Thank you!

LarsGab commented 3 years ago

Hi, I haven't encountered a problem with those scripts yet. Please make sure that you used the current versions of TSEBRA and BRAKER for your runs. Reading the error messages, I would guess that somehow the coordinates of one or more transcripts exceed the end coordinate of their sequence in your genome file. Another reason could be that something went wrong with the coordinates of transcript 'g9042.t1'. Can you post the lines from the gtf file of 'g9042.t1'? Best, Lars

SC-Duan commented 3 years ago

Hi, the all versions are updated. I check the aa file and I think it is ok, but I do not known why it tells "Use of uninitialized value in concatenation (.) or string at ~/Augustus/scripts/gtf2aa.pl line 208.".

For the error with "Coordinate of g9042.t1 out of sequence range. 252844 >= 252844.", the total length of the chromosome (which g9042.t1 in) is 252844. and I check the gffGetmRNA.pl and found the statement:

die ("Coordinate of $txid out of sequence range. $exon->[4] >= " . length($sequence{$chr}) . "\n") if ($exon->[4] >= length($sequence{$chr}));

So if the gene locates the end of chromosome, it get the error. Maybe it should be corrected as: if ($exon->[4] > length($sequence{$chr}));

Again, I found that some genes overlap with repeats, should these genes be discarded?

LarsGab commented 3 years ago

Hi, I think the error in gtf2aa.pl is due to the same special case as in gffGetmRNA.pl.

Did you mask your genome sequence for repeats and did you use the option --softmasking for your BRAKER runs?

SC-Duan commented 3 years ago

Yes, the genome is masked and I use the option --softmasking.

LarsGab commented 3 years ago

There was an issue in the BRAKER repository addressing predicted genes overlapping with repeats: https://github.com/Gaius-Augustus/BRAKER/issues/362 You should probably remove only the genes where almost all of the CDS overlap with repeats. Or you could rerun BRAKER and increase the repeat penalty.

SC-Duan commented 3 years ago

OK. Thank you very much! I will try it.

V-JJ commented 2 years ago

Hello!

I would like to explain the process and problems that we had so far to obtain mRNA, CDS and proteins after running TSEBRA . This software is great and easy to use. However, I would suggest to add or integrate in it the automatic extraction of mRNA, CDS and proteins (as BRAKER does).

Briefly What works for me is:

Run Braker1, Braker2
Merge the results with TSEBRA
Rename the output TSEBRA gtf file (as suggested here #9 )

#!/bin/bash
gtf_file=$1 # Output of TSEBRA
ttable=gtf_translationtable.txt
out_file="${gtf_file}.renamed.gtf"

$HOME/TSEBRA/bin/rename_gtf.py --gtf $gtf_file --translation_tab $ttable --out $out_file

Convert gtf to gff with Augustus script gtf2gff.pl (This step is optional since gffread can work with both format of files).
Retrieve the corresponding mRNA, CDS, and protein sequences with gffread ignoring the Warnings that appear (as suggested by @LarsGab in #457 BRAKER issue Warning: invalid GTF record, transcript_id not found: scf1 AUGUSTUS transcript 347180 347404 0.56 + . g56765.t1

#!/bin/bash
# Input files
genome=$HOME/mygenome.fasta
gtf_file=Tsebra.braker1+2_combined.gtf.renamed.gtf

# Output files
mrna_file="tsebra_gtf_gffread.mRNA.fasta"
cds_file="tsebra_gtf_gffread.CDS.fasta"
prot_file="tsebra_gtf_gffread.prot.fasta"

# Commands
gffread -w $mrna_file -g $genome $gtf_file
gffread -x $cds_file -g $genome $gtf_file
gffread -y $prot_file -g $genome $gtf_file 

echo "# DONE #"

NOTE: I also tried to use Augustus scripts (gffGetmRNA.pl and gtf2aa.pl) to obtain the mRNA and proteins. However these scripts didn't worked for me.

(mRNA) This script: gffGetmRNA.pl gave this kind of errors: Use of uninitialized value in concatenation (.) or string at /users/Programs/Augustus/scripts/gtf2aa.pl line 208.
(Protein) This script: gtf2aa.pl retrieved correctly the great majority of proteins except for 80 cases out of ~60,000 that corresponded to transcripts that have a reading frame "1" or "2". This looks like some bug.

Thanks,

Gaius-Augustus / TSEBRA

How to get aa and CDS sequences #7