Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
358 stars 79 forks source link

--busco_lineage helps with busco scores but not with protein number #876

Open aureliendejode opened 5 days ago

aureliendejode commented 5 days ago

Hello, I have used BRAKER3 with default parameters to annotate 3 anemone genomes and my busco scores were lower than in my genome and so I ran it again using the --busco_lineages option and it solved that issue. However, there is still a big difference in the number of protein among the braker.aa, genemark.aa and augustus.hints.aa files. Is it something that need to be fixed ? (I started to run omark on the braker.aa and the results look fine to me.) If yes, it seems to me this might come from tsebra and there is maybe a way to run tsebra differently ?

Here are the stats for the 2 braker runs:

####Before using --busco_lineage

# BUSCO version is: 5.6.1 
# The lineage dataset is: eukaryota_odb10 (Creation date: 2024-01-08, number of genomes: 70, number of BUSCOs: 255)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker/braker.aa
# BUSCO was run in mode: proteins

    ***** Results: *****

    C:92.5%[S:83.1%,D:9.4%],F:1.6%,M:5.9%,n:255    
    236 Complete BUSCOs (C)            
    212 Complete and single-copy BUSCOs (S)    
    24  Complete and duplicated BUSCOs (D)     
    4   Fragmented BUSCOs (F)              
    15  Missing BUSCOs (M)             
    255 Total BUSCO groups searched     

-rw-r--r-- 1 adejode bmtitus 18M 14 oct.  16:33 Augustus/augustus.hints.aa
-rw-r--r-- 1 adejode bmtitus 10M 14 oct.  16:35 braker.aa
-rw-r--r-- 1 adejode bmtitus 19M 15 oct.  14:19 GeneMark-ETP/genemark.aa

grep -c ">" braker.aa Augustus/augustus.hints.aa GeneMark-ETP/genemark.aa 
braker.aa:19761
Augustus/augustus.hints.aa:38767
GeneMark-ETP/genemark.aa:36201

# BUSCO version is: 5.6.1 
# The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker/braker.aa
# BUSCO was run in mode: proteins

    ***** Results: *****

    C:91.7%[S:83.6%,D:8.1%],F:1.3%,M:7.0%,n:954    
    875 Complete BUSCOs (C)            
    798 Complete and single-copy BUSCOs (S)    
    77  Complete and duplicated BUSCOs (D)     
    12  Fragmented BUSCOs (F)              
    67  Missing BUSCOs (M)             
    954 Total BUSCO groups searched        

####after using --busco_lineage

# BUSCO version is: 5.6.1 
# The lineage dataset is: eukaryota_odb10 (Creation date: 2024-01-08, number of genomes: 70, number of BUSCOs: 255)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker_busco_lineage/braker/braker.aa
# BUSCO was run in mode: proteins

    ***** Results: *****

    C:97.3%[S:72.2%,D:25.1%],F:0.8%,M:1.9%,n:255       
    248 Complete BUSCOs (C)            
    184 Complete and single-copy BUSCOs (S)    
    64  Complete and duplicated BUSCOs (D)     
    2   Fragmented BUSCOs (F)              
    5   Missing BUSCOs (M)             
    255 Total BUSCO groups searched        

Dependencies and versions:
    hmmsearch: 3.1
    busco: 5.6.1

# BUSCO version is: 5.6.1 
# The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker_busco_lineage/braker/braker.aa
# BUSCO was run in mode: proteins

    ***** Results: *****

    C:97.5%[S:69.5%,D:28.0%],F:0.6%,M:1.9%,n:954       
    930 Complete BUSCOs (C)            
    663 Complete and single-copy BUSCOs (S)    
    267 Complete and duplicated BUSCOs (D)     
    6   Fragmented BUSCOs (F)              
    18  Missing BUSCOs (M)             
    954 Total BUSCO groups searched        

-rw-r--r-- 1 adejode bmtitus 18M 16 oct.  10:04 Augustus/augustus.hints.aa
-rw-r--r-- 1 adejode bmtitus 11M 16 oct.  10:07 braker.aa
-rw-r--r-- 1 adejode bmtitus 19M 16 oct.  10:52 GeneMark-ETP/genemark.aa

grep -c ">" braker.aa Augustus/augustus.hints.aa GeneMark-ETP/genemark.aa 
braker.aa:20454
Augustus/augustus.hints.aa:38756
GeneMark-ETP/genemark.aa:36206
KatharinaHoff commented 5 days ago

Why do you think the number of genes is too low? How many genes do you see in close relatives an how were these annotated?

aureliendejode @.***> schrieb am Mi. 16. Okt. 2024 um 18:11:

Hello, I have used BRAKER3 with default parameters to annotate 3 anemone genomes and my busco scores were lower than in my genome and so I ran it again using the --busco_lineages option and it solved that issue. However, there is still a big difference in the number of protein among the braker.aa, genemark.aa and augustus.hints.aa files. Is it something that need to be fixed ? (I started to run omark on the braker.aa and the results look fine to me.) If yes, it seems to me this might come from tsebra and there is maybe a way to run tsebra differently ?

Here are the stats for the 2 braker runs:

Before using --busco_lineage

BUSCO version is: 5.6.1

The lineage dataset is: eukaryota_odb10 (Creation date: 2024-01-08, number of genomes: 70, number of BUSCOs: 255)

Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker/braker.aa

BUSCO was run in mode: proteins

Results:

C:92.5%[S:83.1%,D:9.4%],F:1.6%,M:5.9%,n:255 236 Complete BUSCOs (C)
212 Complete and single-copy BUSCOs (S) 24 Complete and duplicated BUSCOs (D)
4 Fragmented BUSCOs (F)
15 Missing BUSCOs (M)
255 Total BUSCO groups searched

-rw-r--r-- 1 adejode bmtitus 18M 14 oct. 16:33 Augustus/augustus.hints.aa -rw-r--r-- 1 adejode bmtitus 10M 14 oct. 16:35 braker.aa -rw-r--r-- 1 adejode bmtitus 19M 15 oct. 14:19 GeneMark-ETP/genemark.aa

grep -c ">" braker.aa Augustus/augustus.hints.aa GeneMark-ETP/genemark.aa braker.aa:19761 Augustus/augustus.hints.aa:38767 GeneMark-ETP/genemark.aa:36201

BUSCO version is: 5.6.1

The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)

Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker/braker.aa

BUSCO was run in mode: proteins

Results:

C:91.7%[S:83.6%,D:8.1%],F:1.3%,M:7.0%,n:954 875 Complete BUSCOs (C)
798 Complete and single-copy BUSCOs (S) 77 Complete and duplicated BUSCOs (D)
12 Fragmented BUSCOs (F)
67 Missing BUSCOs (M)
954 Total BUSCO groups searched

after using --busco_lineage

BUSCO version is: 5.6.1

The lineage dataset is: eukaryota_odb10 (Creation date: 2024-01-08, number of genomes: 70, number of BUSCOs: 255)

Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker_busco_lineage/braker/braker.aa

BUSCO was run in mode: proteins

Results:

C:97.3%[S:72.2%,D:25.1%],F:0.8%,M:1.9%,n:255
248 Complete BUSCOs (C)
184 Complete and single-copy BUSCOs (S) 64 Complete and duplicated BUSCOs (D)
2 Fragmented BUSCOs (F)
5 Missing BUSCOs (M)
255 Total BUSCO groups searched

Dependencies and versions: hmmsearch: 3.1 busco: 5.6.1

BUSCO version is: 5.6.1

The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)

Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker_busco_lineage/braker/braker.aa

BUSCO was run in mode: proteins

Results:

C:97.5%[S:69.5%,D:28.0%],F:0.6%,M:1.9%,n:954
930 Complete BUSCOs (C)
663 Complete and single-copy BUSCOs (S) 267 Complete and duplicated BUSCOs (D)
6 Fragmented BUSCOs (F)
18 Missing BUSCOs (M)
954 Total BUSCO groups searched

-rw-r--r-- 1 adejode bmtitus 18M 16 oct. 10:04 Augustus/augustus.hints.aa -rw-r--r-- 1 adejode bmtitus 11M 16 oct. 10:07 braker.aa -rw-r--r-- 1 adejode bmtitus 19M 16 oct. 10:52 GeneMark-ETP/genemark.aa

grep -c ">" braker.aa Augustus/augustus.hints.aa GeneMark-ETP/genemark.aa braker.aa:20454 Augustus/augustus.hints.aa:38756 GeneMark-ETP/genemark.aa:36206

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/876, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JDFOLBDMAW7KOD2CADZ32FZNAVCNFSM6AAAAABQB2QX52VHI2DSMVQWIX3LMV43ASLTON2WKOZSGU4TEMZWHA2DSMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

aureliendejode commented 5 days ago

There are not many genomes of closely relative... but as an example Nematostella vectensis has ~19 000 protein coding genes and ~38 000 genes and the annotation was conducted with the NCBI Eukaryotic Genome Annotation Pipeline.

I am actually not sure the number of genes is too low, I was just wondering if the differences in terms of number of sequences (among braker.aa, genemark.aa and augustus.hints.aa) is something to be concerned about ? Especially since the braker file is quite smaller (gft files) and contains way less proteins than the augustus and genemark ones.

-rw-r--r-- 1 adejode bmtitus 89M 14 oct.  11:31 GeneMark-ETP/genemark.gtf
-rw-r--r-- 1 adejode bmtitus 67M 14 oct.  16:33 Augustus/augustus.hints.gtf
-rw-r--r-- 1 adejode bmtitus 46M 14 oct.  16:34 braker.gtf
KatharinaHoff commented 5 days ago

The difference between Augustus, GeneMark-ETP and BRAKER alone is not a strong indication that anything is wrong. A lack of evidence on the other hand would lead to too strict filtering. But we have no numbers to estimate this. I would not worry about it if nothing important is missing. You say OMArk scores are good, one remote relative has similar numbers - it may be ok.

aureliendejode @.***> schrieb am Mi. 16. Okt. 2024 um 20:11:

There are not many genomes of closely relative... but as an example Nematostella vectensis has ~19 000 protein coding genes and ~38 000 genes and the annotation was conducted with the NCBI Eukaryotic Genome Annotation Pipeline.

I am actually not sure the number of genes is too low, I was just wondering if the differences in terms of number of sequences (among braker.aa, genemark.aa and augustus.hints.aa) is something to be concerned about ? Especially since the braker file is quite smaller (gft files) and contains way less proteins than the augustus and genemark ones.

-rw-r--r-- 1 adejode bmtitus 89M 14 oct. 11:31 GeneMark-ETP/genemark.gtf -rw-r--r-- 1 adejode bmtitus 67M 14 oct. 16:33 Augustus/augustus.hints.gtf -rw-r--r-- 1 adejode bmtitus 46M 14 oct. 16:34 braker.gtf

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/876#issuecomment-2417568356, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JB6TQWJSGRUYYBFOW3Z32T3ZAVCNFSM6AAAAABQB2QX52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJXGU3DQMZVGY . You are receiving this because you commented.Message ID: @.***>

aureliendejode commented 4 days ago

Great, thanks for your insights on this!