Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
367 stars 81 forks source link

Exon models look good, but gene models are far too large #859

Open jtoy7 opened 2 months ago

jtoy7 commented 2 months ago

Hello,

I apologize if a similar issue has been posted elsewhere, but I did not find one when I searched.

I am trying to annotate a new reference genome (a marine fish; assembly is near-chromosome-level) by supplying BRAKER3 with RNAseq data from a previous experiment and protein orthology data from OrthoDB. Looking at summary statistics of my annotation (after clipping overlapping and redundant features), it looks like BRAKER is doing an okay job of predicting exons, because they total to about 6% of the genome size with a reasonable total gene number of 21,378. However, when looking at total gene length (introns + exons), genes end up taking up about 45% of the genome length, which is obviously not correct.

I'm guessing that multiple genes are being modeled as single genes, which then incorrectly incorporates the space between them as large "introns". More could be learned by manually comparing these gene models to high quality annotations from other species, but this is obviously not feasible to do for 20,000 genes.

I am wondering if you have any recommendations for changes in the running parameters in my BRAKER command that may help mitigate this issue. I'm not sure if this can be incorporated, but I do have a reference transcriptome that was assembled using TopHat2 and the same RNAseq data used in the BRAKER run. My command is below. Thanks!

braker.pl --species=EJA_1 --genome=$REF \
       --rnaseq_sets_ids=amb_1_trimmo,amb_2_trimmo,amb_3_trimmo,amb_4_trimmo,low_muscle_1_trimmo,low_muscle_2_trimmo,low_muscle_3_trimmo,low_muscle_4_trimmo,low_stat_1_trimmo,low_stat_2_trimmo,low_stat_3_trimmo,low_stat_4_trimmo,low_var_1_trimmo,low_var_2_trimmo,low_var_3_trimmo,low_var_4_trimmo,mod_muscle_1_trimmo,mod_muscle_2_trimmo,mod_muscle_3_trimmo,mod_muscle_4_trimmo,mod_stat_1_trimmo,mod_stat_2_trimmo,mod_stat_3_trimmo,mod_stat_4_trimmo,mod_var_1_trimmo,mod_var_2_trimmo,mod_var_3_trimmo,mod_var_4_trimmo \
       --rnaseq_sets_dirs=$RNADIR \
       --prot_seq=$PROT \
       --threads=42 \
       --workingdir=/hb/groups/bernardi_lab/jason/EJA_braker3/EJA_1 \
       --gff3 \
       --makehub --email=xxxx@xxx.edu \
       > EJA_1_log.out 2> EJA_1_log.err
KatharinaHoff commented 2 months ago

I regrettably have no easy parameter modification recommendation. Are you sure you have fusion genes?

I recommend that you visualize both the gene set and the evidence tracks in a Browser. MakeHub is an easy option to generate the tracks. Look at whether the evidence supports your impression that the long introns are indeed wrong. Also look whether there are other really valid options to split the genes. Some species have very long introns. We have heard this e.g. from some trees. (I did understand you work on fish.)

Note that the mapping tools that are used (here Hisat2) probably have an internal limit for splice junction sizes. Not seeing a junction that long does not mean it really does not exist in the RNA-Seq data. You may have to remap with other tools/other parameters and load the information as additional track to confirm. But something like "single exon genes are now false multi-exon genes" should be recognizable from the ProtHint evidence. You can at least rather easily spot the genes, and then do a BLAST or DIAMOND of the exon amino acid sequence to confirm.

The StringTie assembly command is inside of GeneMark-ETP. It is currently not implemented in a way that is easy to modify from the outside (no command line option).

On Thu, Sep 19, 2024 at 12:50 AM Jason Toy @.***> wrote:

Hello,

I apologize if a similar issue has been posted elsewhere, but I did not find one when I searched.

I am trying to annotate a new reference genome (a marine fish; assembly is near-chromosome-level) by supplying BRAKER3 with RNAseq data from a previous experiment and protein orthology data from OrthoDB. Looking at summary statistics of my annotation (after clipping overlapping and redundant features), it looks like BRAKER is doing an okay job of predicting exons, because they total to about 6% of the genome size with a reasonable total gene number of 21,378. However, when looking at total gene length (introns + exons), genes end up taking up about 45% of the genome length, which is obviously not correct.

I'm guessing that multiple genes are being modeled as single genes, which then incorrectly incorporates the space between them as large "introns". More could be learned by manually comparing these gene models to high quality annotations from other species, but this is obviously not feasible to do for 20,000 genes.

I am wondering if you have any recommendations for changes in the running parameters in my BRAKER command that may help mitigate this issue. I'm not sure if this can be incorporated, but I do have a reference transcriptome that was assembled using TopHat2 and the same RNAseq data used in the BRAKER run. My command is below. Thanks!

braker.pl --species=EJA_1 --genome=$REF \ --rnaseq_sets_ids=amb_1_trimmo,amb_2_trimmo,amb_3_trimmo,amb_4_trimmo,low_muscle_1_trimmo,low_muscle_2_trimmo,low_muscle_3_trimmo,low_muscle_4_trimmo,low_stat_1_trimmo,low_stat_2_trimmo,low_stat_3_trimmo,low_stat_4_trimmo,low_var_1_trimmo,low_var_2_trimmo,low_var_3_trimmo,low_var_4_trimmo,mod_muscle_1_trimmo,mod_muscle_2_trimmo,mod_muscle_3_trimmo,mod_muscle_4_trimmo,mod_stat_1_trimmo,mod_stat_2_trimmo,mod_stat_3_trimmo,mod_stat_4_trimmo,mod_var_1_trimmo,mod_var_2_trimmo,mod_var_3_trimmo,mod_var_4_trimmo \ --rnaseq_sets_dirs=$RNADIR \ --prot_seq=$PROT \ --threads=42 \ --workingdir=/hb/groups/bernardi_lab/jason/EJA_braker3/EJA_1 \ --gff3 \ --makehub @.*** \

EJA_1_log.out 2> EJA_1_log.err

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/859, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JHXRAPCYR6LA4GKUSDZXH7UNAVCNFSM6AAAAABOOVWDEGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGUZTIOJQHA4DKMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jtoy7 commented 2 months ago

Hi Katharina. Thanks so much for the quick response. I will look into the avenues of clarification that you have outlined. With regard to the assembly option of StringTie, I assume this would not be something I could easily edit within the source code for one-time use? (e.g., finding the StringTie command and adding -G transcriptome.gtf)

KatharinaHoff commented 2 months ago

Yes, you can alter things directly in the code. I on occasion do this, myself.