Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
364 stars 81 forks source link

Reduced BUSCO score with protein and rna input #718

Closed Kat0610 closed 11 months ago

Kat0610 commented 11 months ago

Hello,

I ran braker3 with different inputs for the same genome to hopefully find the most complete annotation. However I encountered the issue, that the generated annotation with paired End RNA reads as input is more complete than the run with both protein (OrthoDB) and the same RNA paired End reads as the input. Is that a result that can be expected or is there an issue with my commands/run?

Any help would be appreciated!

I mapped the RNA reads before i started braker with hisat2 and the completeness was assessed with BUSCO.

--Command paired End Reads---

_perl /vol/data/tools/braker_tools/BRAKER/scripts/braker.pl \ --species=231121_PE \ --genome=genome_v0.1_polished_clean.fasta \ --bam=hisat2_paired_sorted.bam \ --threads=8 \ --gff3 \ --workingdir=/vol/data/katwolff/20231128_braker3pe/

--Command both paired End and OrthoDB--

perl /vol/data/tools/braker_tools/BRAKER/scripts/braker.pl \ --species=231122_all \ --genome=genome_v0.1_polished_clean.fasta \ --prot_seq=/vol/data/katwolff/20231115_braker3_OrthoDB/Viridiplantae.fa \ --bam=urtica_hisat2_all_sorted.bam \ --threads=8 \ --gff3 \ --workingdir=/vol/data/katwolff/20231122_braker3_all/

-- BUSCO paired end --

BUSCO version is: 5.4.4 The lineage dataset is: embryophyta_odb10 (Creation date: 2020-09-10, number of genomes: 50, number of BUSCOs: 1614) Summarized benchmarking in BUSCO notation for file /vol/data/katwolff/20231115_braker3_both/braker.aa BUSCO was run in mode: proteins

Results:

C:94.4%[S:22.4%,D:72.0%],F:2.8%,M:2.8%,n:1614
1523 Complete BUSCOs (C)
361 Complete and single-copy BUSCOs (S)
1162 Complete and duplicated BUSCOs (D)
45 Fragmented BUSCOs (F)
46 Missing BUSCOs (M)
1614 Total BUSCO groups searched

Dependencies and versions: hmmsearch: 3.1 busco: 5.4.4

-- BUSCO paired end RNA and Protein hints --

BUSCO version is: 5.4.4 The lineage dataset is: embryophyta_odb10 (Creation date: 2020-09-10, number of genomes: 50, number of BUSCOs: 1614) Summarized benchmarking in BUSCO notation for file /vol/data/katwolff/20231122_braker3_all/braker.aa BUSCO was run in mode: proteins

Results:

C:90.3%[S:26.5%,D:63.8%],F:1.9%,M:7.8%,n:1614
1457 Complete BUSCOs (C)
428 Complete and single-copy BUSCOs (S)
1029 Complete and duplicated BUSCOs (D)
30 Fragmented BUSCOs (F)
127 Missing BUSCOs (M)
1614 Total BUSCO groups searched

Dependencies and versions: hmmsearch: 3.1 busco: 5.4.4

KatharinaHoff commented 11 months ago

This is result of how TSEBRA is executed in BRAKER3. We are aware of the issue. There are solutions, but if we increase the BUSCOs in the final gene set, at the same time increase the false positive genes.

We are working on a solution, I will link to the a different open issue. https://github.com/Gaius-Augustus/BRAKER/issues/634