Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
363 stars 81 forks source link

BRAKER run finishes but predictions don't seem correct #462

Closed nhv221 closed 1 year ago

nhv221 commented 2 years ago

Hi,

I am running BRAKER v2.1.6 on a ~3Gbp soft-masked shark genome (~2000 contigs) and am running the BRAKER pipeline with RNA-Seq data. The run finishes without any problems according to the braker.log.

However, running BUSCO on the augustus.hints.aa results in 98.5% missing BUSCOs.

The only suspicious thing I could see in the logs was in filterGenemark.stderr: Rate of one exon genes (of good genes) cannot be computed since only complete genes can be 'goood'.

This is the output of filterGenemark.stdout

Number of cds hints is 0
Average gene length: 9706
Average number of introns: 5.98033166216737
Good gene rate: 0
Number of genes: 12011
Number of complete genes: 7779
Number of good genes: 0
Number of one-exon-genes: 840
Number of bad genes: 11613
Good intron rate: 0
One exon gene rate (of all genes): 0.0699358920989093

I checked the number of introns in genemark_hintsfile.gff: 4289623, of those, 672333 introns have coverage >10. Like in this issue https://github.com/Gaius-Augustus/BRAKER/issues/266#issuecomment-698996118, I checked the hints files and saw that only 23 introns predicted in genemark.gtf are present in hintsfile.gff. I am not sure where the issue lies.

Some background info: I have run the test data as in test1.sh which looks good according to compare_intervals_exact.pl. BUSCO on the assembly suggests the assembly is quite complete (94% complete BUSCOs) as well as the assembled transcripts (93% complete BUSCOs). Genome: Shark genome ~3Gbp. Around 1600 of the 2252 total contigs are larger than 50kb. Masking: Genome was softmasked using custom library generated by RepeatModeler, and masked with RepeatMasker. External evidence: Several large RNA-seq libraries of paired-end reads mapped to genome using HISAT2. Alignment rate of ~80%. Command: braker.pl --species=pGlauca --genome=Pgla_v1_contigs.fa.masked --cores 24 --bam=RNAseq.sorted.bam --softmasking --nocleanup

Any help would be appreciated, Naima

GeneMark-ET.stdout.txt

KatharinaHoff commented 2 years ago

Shark is a vertebrate and apparently the genome is high on repeats due to the size. It is not completely unexpected that fully automated training won't work very well. Have you tried braker with --skipAllTraining using a previously existing parameter set (e.g. human or one of the fish parameter sets)?

On Thu, Feb 17, 2022 at 1:44 PM nhv221 @.***> wrote:

Hi,

I am running BRAKER v2.1.6 on a ~3Gbp soft-masked shark genome (~2000 contigs) and am running the BRAKER pipeline with RNA-Seq data. The run finishes without any problems according to the braker.log.

However, running BUSCO on the augustus.hints.aa results in 98.5% missing BUSCOs.

The only suspicious thing I could see in the logs was in filterGenemark.stderr: Rate of one exon genes (of good genes) cannot be computed since only complete genes can be 'goood'.

This is the output of filterGenemark.stdout

Number of cds hints is 0 Average gene length: 9706 Average number of introns: 5.98033166216737 Good gene rate: 0 Number of genes: 12011 Number of complete genes: 7779 Number of good genes: 0 Number of one-exon-genes: 840 Number of bad genes: 11613 Good intron rate: 0 One exon gene rate (of all genes): 0.0699358920989093

I checked the number of introns in genemark_hintsfile.gff: 4289623, of those, 672333 introns have coverage >10. Like in this issue #266 (comment) https://github.com/Gaius-Augustus/BRAKER/issues/266#issuecomment-698996118, I checked the hints files and saw that only 23 introns predicted in genemark.gtf are present in hintsfile.gff. I am not sure where the issue lies.

Some background info: I have run the test data as in test1.sh which looks good according to compare_intervals_exact.pl. BUSCO on the assembly suggests the assembly is quite complete (94% complete BUSCOs) as well as the assembled transcripts (93% complete BUSCOs). Genome: Shark genome ~3Gbp. Around 1600 of the 2252 total contigs are larger than 50kb. Masking: Genome was softmasked using custom library generated by RepeatModeler, and masked with RepeatMasker. External evidence: Several large RNA-seq libraries of paired-end reads mapped to genome using HISAT2. Alignment rate of ~80%. Command: braker.pl --species=pGlauca --genome=Pgla_v1_contigs.fa.masked --cores 24 --bam=RNAseq.sorted.bam --softmasking --nocleanup

Any help would be appreciated, Naima

GeneMark-ET.stdout.txt https://github.com/Gaius-Augustus/BRAKER/files/8088537/GeneMark-ET.stdout.txt

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/462, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JEE2WZT7JXYM3MUXHLU3TUUJANCNFSM5OUSK6LA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

nhv221 commented 2 years ago

Thank you for your suggestions. I tried braker with --skipAllTraining with both elephant shark and human parameter sets. Only ~2000 genes were predicted with the elephant shark set, but the run with the human parameter set looks better with 22000 genes and a busco complete score of 92%.