Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
363 stars 81 forks source link

Diagnosing low gene counts? #612

Closed skagawa2 closed 1 year ago

skagawa2 commented 1 year ago

Hello!

I wanted to ask if there was any way to figure out why we might be getting low gene numbers from BRAKER (both BRAKER3 and BRAKER1 + BRAKER2 + TSEBRA had similarly low gene counts than what we expected). Are there any files that we could potentially look at for maybe the confidence of RNA-seq alignments or hints that could tell us to use a different database for the protein evidence?

We have an assembly of Tenebrio molitor that we're trying to annotate and BRAKER3 seems to output only 13.5k genes, much less than we expected (other published assemblies had 20k). However, the BUSCO protein score was higher in the BRAKER3 run than in the BRAKER1 + BRAKER2 + TSEBRA run despite having fewer genes (13.5k vs 15.4k in TSBERA), so it has been difficult to decide which protein set to use and we were wondering if we could diagnose this problem somehow or improve this protein set in any way.

Thank you in advance!

KatharinaHoff commented 1 year ago

I recommend inspecting the predictions with evidence in a Genome Browser (e.g. UCSC Genome Browser or JBrowse). Please also display the gene sets prior merging with TSEBRA.

TSEBRA tends to discard all transcripts that have no evidence. So if the evidence situation is comparably poor, you lose too many transcripts. In the Supplementary of the GALBA preprint, I show a command line that enforces keeping one gene set regardless the evidence ( https://www.biorxiv.org/content/10.1101/2023.04.10.536199v1)

Note, that BRAKER3 also keeps the unmerged gene sets in subfolders. You find the augustus.hints.gtf file in a subfolder Augustus.

Best,

Katharina

@LarsGab : another reason why we need to implement catching this, automatically, in BRAKER...

On Tue, Apr 11, 2023 at 11:16 PM Shaw Kagawa @.***> wrote:

Hello!

I wanted to ask if there was any way to figure out why we might be getting low gene numbers from BRAKER (both BRAKER3 and BRAKER1 + BRAKER2 + TSEBRA had similarly low gene counts than what we expected). Are there any files that we could potentially look at for maybe the confidence of RNA-seq alignments or hints that could tell us to use a different database for the protein evidence?

We have an assembly of Tenebrio molitor that we're trying to annotate and BRAKER3 seems to output only 13.5k genes, much less than we expected (other published assemblies had 20k). However, the BUSCO protein score was higher in the BRAKER3 run than in the BRAKER1 + BRAKER2 + TSEBRA run despite having fewer genes (13.5k vs 15.4k in TSBERA), so it has been difficult to decide which protein set to use and we were wondering if we could diagnose this problem somehow or improve this protein set in any way.

Thank you in advance!

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JHVWVOZ6XGJBJ3FKPDXAXC4RANCNFSM6AAAAAAW2ZRP5A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

skagawa2 commented 1 year ago

For looking at the evidence using a genome browser, would you recommend showing tracks for 1) input reference proteins (in our case, Arthropoda ODBv10 aligned to the genome), 2) RNA-seq reads (aligned using hisat separately? is the BAM file that was used in BRAKER1 somewhere within the output?), 3) BRAKER1 result, 4) BRAKER2 result, and the 5) TSEBRA result or BRAKER3 result? I assume I would look for any obvious signs of lack of RNA-seq alignments to look for low-quality evidence and the presence of many not-evidence-based gene predictions, but is there anything else I should look for?

As for keeping one gene set, the preprint command tsebra.py -g braker.gtf --keep_gtf galba.gtf -e braker_hintsfile.gff,galba_hintsfile.gff -c default.cfg -o tsebra.gtf has only braker.gtf in the -g parameter, is that correct? Should it not be braker.gtf,galba.gtf? If so, is there one gene set that should be used in -g that is better than the other?

I tried (for my annotations) using -g braker1,braker2 -k braker1 and -g braker1,braker2 -k braker2, which both had more reasonable gene counts (~18k for both, BRAKER1 being slightly higher, up from ~15k without -k). My only thought behind why BRAKER1 would be higher (and the one we could end up using on the basis of higher gene count) is due to false positive predictions in non-coding regions, but should there be any other motivation to use one or the other in -g and in -k?

Thank you for your help so far!

skagawa2 commented 1 year ago

Sorry, maybe this problem is not what I titled the issue. We're trying to figure out why we get various gene counts. Here are the different runs we did (starting with the same base assembly using hifiasm):

  1. RepeatMasker v2.0.3 (library: RepeatModeler v4.1.2-p1) then BRAKER1 (v2.1.6) + BRAKER2 (v2.1.6) + TSEBRA (v1.0.3) then remove contigs in gtf that were removed by purge_haplotigs -> 16,820 genes, BUSCO protein score: C:92.0%[S:87.7%,D:4.3%],F:1.2%,M:6.8%,n:1367
  2. purge_haplotigs then RepeatMasker v2.0.4 (library: RepeatModeler v4.1.4 merged with RepBase Insecta) then BRAKER1 (v3.0.2) + BRAKER2 (v3.0.2) + TSEBRA (from BRAKER3 Singularity image, v1.1.0?) -> 15,468 genes, BUSCO protein score: C:92.6%[S:71.8%,D:20.8%],F:1.6%,M:5.8%,n:1367
  3. purge_haplotigs then RepeatMasker v2.0.4 (library: RepeatModeler v4.1.4 merged with RepBase Insecta) then BRAKER3 (v3.0.2) -> 13,511 genes, BUSCO protein score: C:97.0%[S:77.8%,D:19.2%],F:0.1%,M:2.9%,n:1367

_I understand that the first method might be unconventional, but this was done under the assumption that removing the contigs that were purged by purgehaplotigs in a later step would be the same (in effect) as removing the contigs first then annotating, although I may be wrong in these assumptions.

I think the main question becomes what the main differences between BRAKER v2.1.6 and v3.0.2 are unless the culprit is RepeatMasker and RepeatModeler. We ruled out the possibility of the evidence being sparse being the reason for the low gene count as we were using the same RNA-seq reads and the same protein evidence (Arthropoda ODBv10) in all three of these situations.

Please let me know if you have any additional questions about our weird methodology or if you have any insight on what could have caused this difference (or if this is within the normal range of variation).

I apologize for the confusion.

KatharinaHoff commented 1 year ago

There are several differences between the braker versions that you used. Among others, we changed what the default output is... in braker3, it's a tsebra gene set from augustus and genemark, before, we combined the gene sets in a different way. But that's only one of many differences.

The numbers that you report are not unexpected.

One thing is that the default BUSCO does not always handle alternative transcripts, correctly. So the duplicate estimate might not be correct.

The last run that you show has the highest BUSCOs. If you think the total gene count is too low you can re-run tsebra on the output, enforcing e.g. the augustus.hints gene set from the subfolder Augustus. Depending on the evidence situation, TSEBRA might discard a bit too many genes (known issue).

You could run OMArk as an additional measure to estimate prediction quality.

skagawa2 commented 1 year ago

That makes a lot of sense! Thank you for your patience with my explanations as well. When re-running TSEBRA on the output, is this the correct command? I actually don't see an Augustus subfolder in my BRAKER3 run's braker folder, but I do have an augustus.hints.gtf:

singularity exec braker3.sif \
    tsebra.py -g braker/braker.gtf -k braker/augustus.hints.gtf -e braker/hintsfile.gff -c default.cfg -o tsebra_rerun.gtf

I will also try out OMArk, I hadn't heard of this tool before.

skagawa2 commented 1 year ago

We've decided to go with the result from the above command, roughly 18k genes in the final output. Thank you for your help!

chipotlau commented 1 year ago

This thread was very helpful! I have a quick question — after re-running TSEBRA on the output, how did you get protein/CDS from the gtf output file? Do you know if there is a script for this?

KatharinaHoff commented 10 months ago

https://github.com/Gaius-Augustus/Augustus/blob/master/scripts/getAnnoFastaFromJoingenes.py