Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
347 stars 79 forks source link

using compute_accuracies.sh #847

Closed JBerthelier closed 1 week ago

JBerthelier commented 1 month ago

Dear Braker3 authors,

I would like to use the compute_accuracies.sh tool provided in Braker suite to check my annotation accuraty agaisnt a reference one. I checked Braker3 and Galba publications, I googled it, but found no details about the tool usage.

Could you provide quick explaination of the input data ?

My guess is:

I think a regular usage is: bash compute_accuracies.sh annot.gtf pseudogenes.gff prediction.gtf gene cds intron

annot.gtf -> the reference annotation, but does it can contain pseudogenes ? pseudogenes.gff -> pseudogene annotation prediction.gtf -> the braker3.gtf gene cds intron -> the feature I wish

I am mostly concerned regarding the pseudogene annotation, how did/do you generate them ?

Thanks for your time,

Jeremy

KatharinaHoff commented 1 month ago

Dear Jeremy,

the processing of reference annotations is documented here: https://github.com/gatech-genemark/EukSpecies-BRAKER2

Go to the species subfolders. There is detailed documentation for each species, including how we generated the file pseudo.gff3.

Best,

Katharina

On Wed, Jul 17, 2024 at 11:29 AM Jérémy Berthelier @.***> wrote:

Dear Braker3 authors,

I would like to use the compute_accuracies.sh tool provided in Braker suite to check my annotation accuraty agaisnt a reference one. I checked Braker3 and Galba publications, I googled it, but found no details about the tool usage.

Could you provide quick explaination of the input data ?

My guess is:

I regular usage is: bash compute_accuracies.sh annot.gtf pseudogenes.gff prediction.gtf gene cds intron

annot.gtf -> the reference annotation, but does it can contain pseudogenes ? pseudogenes.gff -> pseudo gene annotation prediction.gtf -> the braker3.gtf gene cds intron -> the feature I wish

I am mostly concerned regarding the pseudogene annotation, how did/do you generate them ?

Thanks for your time,

Jeremy

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JHBF3QARVWYJA4UAF3ZMY2RBAVCNFSM6AAAAABLAHLSGCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTGMJRHE4DKNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

JBerthelier commented 1 month ago

‌Dear Katharina, 

Thank you for your quick reply. I will check it!

Best

 

Jeremy

  De : "Katharina Hoff" @.> A : "Gaius-Augustus/BRAKER" @.>,"Jérémy Berthelier" @.>,"Author" @.> Envoyé: mercredi 17 Juillet 2024 17:08 Objet : Re: [Gaius-Augustus/BRAKER] using compute_accuracies.sh (Issue #847)  

  Dear Jeremy,

the processing of reference annotations is documented here: https://github.com/gatech-genemark/EukSpecies-BRAKER2

Go to the species subfolders. There is detailed documentation for each species, including how we generated the file pseudo.gff3.

Best,

Katharina

On Wed, Jul 17, 2024 at 11:29 AM Jérémy Berthelier @.***> wrote:

Dear Braker3 authors,

I would like to use the compute_accuracies.sh tool provided in Braker suite to check my annotation accuraty agaisnt a reference one. I checked Braker3 and Galba publications, I googled it, but found no details about the tool usage.

Could you provide quick explaination of the input data ?

My guess is:

I regular usage is: bash compute_accuracies.sh annot.gtf pseudogenes.gff prediction.gtf gene cds intron

annot.gtf -> the reference annotation, but does it can contain pseudogenes ? pseudogenes.gff -> pseudo gene annotation prediction.gtf -> the braker3.gtf gene cds intron -> the feature I wish

I am mostly concerned regarding the pseudogene annotation, how did/do you generate them ?

Thanks for your time,

Jeremy

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JHBF3QARVWYJA4UAF3ZMY2RBAVCNFSM6AAAAABLAHLSGCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTGMJRHE4DKNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

JBerthelier commented 1 week ago

Dear Braker3 authors,

Thanks for your previous help regading compute_accuracies.sh!

I ran Braker3 with same dataset as specified in your braker3 paper for Arabidopsis thaliana (the SRA RNAseq and the close relative protein DB).

I am next checking that my output is similar to yours by comparing with the reference annotation of A. thaliana with compute_accuracies.sh.

To do so, I produced the parsed annot.gtf for A. thaliana from Araport11 following your tutorial: https://github.com/gatech-genemark/EukSpecies-BRAKER2/blob/master/Arabidopsis_thaliana/README.md

I got the exact same file than the one you are providing: https://github.com/gatech-genemark/EukSpecies-BRAKER2/blob/master/Arabidopsis_thaliana/annot/annot.gtf.gz

Using compute_accuracies.sh, the results of Sp and Sn for gene and transcript level, were similar to yours reported in the Braker3 paper! however, I did not managed to get the scores for the exon level. The tool compute_accuracies.sh is reporting me an error message.

My command is:

bash compute_accuracies.sh annot.gtf pseudo.gff3 braker.fixed.gtf gene trans exon

My output of compute_accuracies.sh is the following:

gene_Sn 83.54 gene_Sp 84.81 trans_Sn 59.35 trans_Sp 79.03 error, no intervals in file: annot.gtf exon_Sn exon_Sp

I noticed there is not exon information in the annot.gtf, I also got the same error message with the file you provided here https://github.com/gatech-genemark/EukSpecies-BRAKER2/blob/master/Arabidopsis_thaliana/annot/annot.gtf.gz

Please could you help me to know how you did got the Sp and Sn for exon level? Did you applied any extra step in the annot.gtf file to get the exon information?

Best regards,


The 10 first lines of my files looks like that :

head annot.gtf

Chr1 Araport11 CDS 3760 3913 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; cds_type "Initial"; count "1_6"; Chr1 Araport11 CDS 3996 4276 . + 2 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; cds_type "Internal"; count "2_6"; Chr1 Araport11 CDS 4486 4605 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; cds_type "Internal"; count "3_6"; Chr1 Araport11 CDS 4706 5095 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; cds_type "Internal"; count "4_6"; Chr1 Araport11 CDS 5174 5326 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; cds_type "Internal"; count "5_6"; Chr1 Araport11 CDS 5439 5630 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; cds_type "Terminal"; count "6_6"; Chr1 Araport11 intron 3914 3995 . + 1 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; count "1_5"; site_seq "GT_AG"; Chr1 Araport11 intron 4277 4485 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; count "2_5"; site_seq "GT_AG"; Chr1 Araport11 intron 4606 4705 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; count "3_5"; site_seq "GT_AG"; Chr1 Araport11 intron 5096 5173 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; count "4_5"; site_seq "GT_AG";

head pseudo.gff

gff-version 3

sequence-region Chr1 1 30427671

sequence-region Chr2 1 19698289

sequence-region Chr3 1 23459830

sequence-region Chr4 1 18585056

sequence-region Chr5 1 26975502

Chr1 Araport11 pseudogene 402693 402961 . - . ID=AT1G02136;Note=pseudogene;Name=AT1G02136;curator_summary=pseudogene of phagocytosis and cell motility protein;locus_type=pseudogene;Dbxref=locus:4515102491,locus:4515102491;computational_description=pseudogene of ELMO/CED-12 family protein Chr1 Araport11 pseudogenic_transcript 402693 402961 . - . ID=AT1G02136.1;Parent=AT1G02136;Name=AT1G02136.1;index=1;Note=pseudogene of ELMO/CED-12 family protein;curator_summary=pseudogene of phagocytosis and cell motility protein;Dbxref=gene:4515100475,gene:4515100475

Chr1 Araport11 transposable_element_gene 433031 433819 . - . ID=AT1G02228;Note=transposable_element_gene;Name=AT1G02228;curator_summary=pseudogene of no apical meristem (NAM) family protein;Dbxref=locus:4010713408;locus_type=transposable_element_gene

head braker.fixed.gtf

Chr1 AUGUSTUS start_codon 3760 3762 . + 0 transcript_id "Chr1+_g1.t1"; gene_id "Chr1+_g1"; Chr1 AUGUSTUS CDS 3760 3913 1 + 0 transcript_id "Chr1+_g1.t1"; gene_id "Chr1+_g1"; Chr1 AUGUSTUS exon 3760 3913 . + . transcript_id "Chr1+_g1.t1"; gene_id "Chr1+_g1"; Chr1 AUGUSTUS intron 3914 3995 1 + . transcript_id "Chr1+_g1.t1"; gene_id "Chr1+_g1"; Chr1 AUGUSTUS CDS 3996 4276 1 + 2 transcript_id "Chr1+_g1.t1"; gene_id "Chr1+_g1"; Chr1 AUGUSTUS exon 3996 4276 . + . transcript_id "Chr1+_g1.t1"; gene_id "Chr1+_g1"; Chr1 AUGUSTUS intron 4277 4485 1 + . transcript_id "Chr1+_g1.t1"; gene_id "Chr1+_g1"; Chr1 AUGUSTUS CDS 4486 4605 1 + 0 transcript_id "Chr1+_g1.t1"; gene_id "Chr1+_g1"; Chr1 AUGUSTUS exon 4486 4605 . + . transcript_id "Chr1+_g1.t1"; gene_id "Chr1+_g1"; Chr1 AUGUSTUS intron 4606 4705 1 + . transcript_id "Chr1+_g1.t1"; gene_id "Chr1+_g1";

JBerthelier commented 1 week ago

Dear Braker3 authors,

please ignore my previous message I found in MandM that the command is

compute_accuracies.sh ref_annot.gtf pseudo.gff3 gene_set.gtf gene trans cds (not exon)

Best,

Jeremy