Gaius-Augustus / TSEBRA

TSEBRA: Transcript Selector for BRAKER
47 stars 5 forks source link

How to convert gtf file from TSEBRA to gff3 file as EVM input? #22

Closed yweii closed 12 months ago

yweii commented 2 years ago

Thanks for this great tool and look forward to your reply. Running TSEBRA (braker1:rna-seq; braker2:proteins), as it should, I got braker1+2_combined.gtf. However, I tried various methods, including 1, use rename_gtf.py/gtf2gff.pl//add_name_to_gff3.pl/augustus_GFF3_to_EVM_GFF3.pl (such as : Error, feature: Chr10-g_40435 is described multiple times with different data values). 2, use augustus_GTF_to_EVM_GFF3.pl from EVM.

The result does not work. Do you have any good methods?

LarsGab commented 2 years ago

Hi,

thanks for using TSEBRA. I tried using augustus_GTF_to_EVM_GFF3.pl from EVM and in my case, it seems to convert the TSEBRA output correctly. Could you give me some more information about the issue with this script, so I can reproduce your problem?

Best, Lars

zaezaeaguinaldo commented 1 year ago

Hi Lars,

I would like to ask if is it normal that the other gene structures (intron, start and stop codons) is removed from the GTF output of Tsebra when converted to EVM GFF3 format? That seems to be the case for me. Attached are the Tsebra output files (both GTF and converted GFF) for reference. Hoping you could me with this matter.

GTF format image

EVM converted GFF format (augustus_GTF_to_EVM_GFF3.pl) image

Thanks, Zae

bijendrabio commented 1 year ago

augustus_GTF_toEVM

@LarsGab it doesn't seems to work in my case. I converted tsebra generated gtf file to gff3 using augustus_GTF_to_EVM_GFF3.pl, but when validating using gff3_gene_prediction_file_validator.pl it gives me following error; Any suggestions?

Regards, B

Error, feature: Chr1-g_638044 is described multiple times with different data values:
$VAR1 = {
          'parent_ID' => undef,
          'rend' => '592671708',
          'contig' => 'Chr1',
          'feature_ID' => 'Chr1-g_638044',
          'feat_type' => 'gene',
          'orient' => '-',
          'lend' => '592671373'

INPUT example:

Chr1    Augustus    gene    592671373   592671693   .   -   .   ID=Chr1-g_638044;Name=Augustus%20prediction
Chr1    Augustus    mRNA    592671373   592671693   .   -   .   ID=Chr1-anno2.g154521.t1;Parent=Chr1-g_638044;Name=Augustus%20prediction
Chr1    Augustus    exon    592671373   592671693   .   -   .   ID=Chr1-anno2.g154521.t1.exon1;Parent=Chr1-anno2.g154521.t1
Chr1    Augustus    CDS 592671373   592671693   .   -   .   ID=cds.Chr1-anno2.g154521.t1;Parent=Chr1-anno2.g154521.t1
Chr1    Augustus    exon    592671373   592671375   .   -   .   ID=Chr1-anno2.g154521.t1.exon2;Parent=Chr1-anno2.g154521.t1
Chr1    Augustus    CDS 592671373   592671375   .   -   .   ID=cds.Chr1-anno2.g154521.t1;Parent=Chr1-anno2.g154521.t1
LarsGab commented 1 year ago

Hi,

I'm not sure what the requirements for EVM are and it might be to be a problem on EVM's end. If it is a problem with the naming convention of the transcript/gene IDs, you can try to first rename all IDs using rename_gtf.py. TSEBRA can also report multiple transcript isoforms per gene. This might also cause problems for EVM, if I remember correctly.

Best, Lars