Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
364 stars 81 forks source link

Intron with RNAseq but no intron with RNAseq + protein #684

Closed gchevignon closed 1 year ago

gchevignon commented 1 year ago

Hello,

There is something I don't understand : why with RNAseq only I got lot of predicted genes with introns whereas when I add protein from other species from the same supergroup or even from eukaryote odb10 there are no more intron in the predicted genes ... I know the specie I work on should possess intron .

You'll find both type of predictions in the screenshot.

Best

Germain

Braker3_example.pdf

LarsGab commented 1 year ago

Hi,

could you compare the predictions from Augustus located at Augustus/augustus.hints.gtf from your runs? If they contain introns, there may be an issue with the hints generated from your protein data. Can you ensure that the hintsfile.gff is not empty?

Best, Lars

gchevignon commented 1 year ago

Hi Lars Thanks for your reply ! My Augustus/augustus.hints.gtf do not contain intron and the hintsfile.gff is not empty. Best

Germain

LarsGab commented 1 year ago

This is very unusual. Can you provide the version of BRAKER you're using and the command line you executed for running BRAKER?

Best, Lars

gchevignon commented 1 year ago

This Braker3 v3.0.3

Here is the PBS command:

!/usr/bin/env bash

PBS -N braker3

PBS -q omp

PBS -l ncpus=50

PBS -l mem=100gb

PBS -l walltime=24:00:00

Load Singularity

. /etc/profile.d/modules.sh module purge module load singularity/3.4.1

Braker singularity container

BRAKER_SIF=/appli/bioinfo/braker3/3.0.3/singularity/braker3.sif

GENOME=Genome_polished.fasta

OUTDIR=/operation/14-braker3-6/

BAM=Genome_polished.bam.sort.bam

PROT=refseq_db_Euk.faa

Run Braker

singularity run -B $DATAWORK ${BRAKER_SIF} braker.pl --species=BonamiaOstreae --genome=${GENOME} --bam=${BAM} --threads ${NCPUS} --workingdir=${OUTDIR} --prot_seq=${PROT} --gff3 >& ${OUTDIR}braker3.e${PBS_JOBID}_Bo 2>&1

LarsGab commented 1 year ago

Thank you for the details. It seems like something went wrong during training of the gene finders. I assume the Genemark-ETP prediciton (GeneMark-ETP/genemark.gtf) has also no introns, did GeneMark-ETP produce any error message (errors/GeneMark-ETP.stderr)? My suspicion is that there might have gone wrong during the GeneMark-ETP step. Does your BAM file conform with the required format (here)? Alternatively, would it be possible for you to run the pipeline once more using the original FASTQ files?

gchevignon commented 1 year ago

Indeed there is an error at the GeneMark-ETP: My BAM should be OK ... I'll re-run with FASTQ Best

FASTA index file /home/datawork-sgmm-nos/DATA_MINION_LGPMM/ONT_Parasites/Bonamia/Purification/operation/14-braker3-2/GeneMark-ETP/data/genome.softmasked.fasta.fai created. Use of uninitialized value $ph1 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205. Use of uninitialized value $ph0 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205. Use of uninitialized value $ph2 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205. Use of uninitialized value $ph0 in division (/) at /opt/ETP/bin/gmes/parse_set.pl line 208. Illegal division by zero at /opt/ETP/bin/gmes/parse_set.pl line 208. cat: GT.mat: No such file or directory cat: AG.mat: No such file or directory cat: GT.mat: No such file or directory cat: AG.mat: No such file or directory cat: GT.mat: No such file or directory cat: AG.mat: No such file or directory cat: GC.mat: No such file or directory cat: GC.mat: No such file or directory cat: GC.mat: No such file or directory Illegal division by zero at /opt/ETP/bin/train_super.pl line 184. error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Traceback (most recent call last): File "/opt/ETP/bin/printRnaAlternatives.py", line 353, in main() File "/opt/ETP/bin/printRnaAlternatives.py", line 289, in main candidates = loadIntrons(args.genemark) File "/opt/ETP/bin/printRnaAlternatives.py", line 193, in loadIntrons for row in csv.reader(open(inputFile), delimiter='\t'): FileNotFoundError: [Errno 2] No such file or directory: 'pred_m/genemark.gtf' error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Died at /opt/ETP/bin/format_back.pl line 14. Died at /opt/ETP/bin/format_back.pl line 14. Use of uninitialized value $ph1 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205. Use of uninitialized value $ph0 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205. Use of uninitialized value $ph2 in addition (+) at /opt/ETP/bin/gmes/parse_set.pl line 205. Use of uninitialized value $ph0 in division (/) at /opt/ETP/bin/gmes/parse_set.pl line 208. Illegal division by zero at /opt/ETP/bin/gmes/parse_set.pl line 208. cat: GT.mat: No such file or directory cat: AG.mat: No such file or directory cat: GT.mat: No such file or directory cat: AG.mat: No such file or directory cat: GT.mat: No such file or directory cat: AG.mat: No such file or directory cat: GC.mat: No such file or directory cat: GC.mat: No such file or directory cat: GC.mat: No such file or directory Illegal division by zero at /opt/ETP/bin/train_super.pl line 184. error, file not found: option --f1 prothint/prothint.gff grep: prothint/evidence.gff: No such file or directory grep: prothint/evidence.gff: No such file or directory Died at /opt/ETP/bin/format_back.pl line 14. Died at /opt/ETP/bin/format_back.pl line 14.

LarsGab commented 1 year ago

In my experience, this usually indicates that the BAM file may not be formatted correctly or there's insufficient RNA-Seq data for GeneMark-ETP. That said, the issue here is not entirely clear. Rerunning with the FASTQ files could help.

gchevignon commented 1 year ago

OK I'll re-run with fastq ! but is the problem could be that I have too much data in my RNAseq (I have around 40gb IIllumina RNAseq data where my ref genome is only 12mb long ...) ?

gchevignon commented 1 year ago

So I have rerunned with fastq files and it works. Thanks a lot for your help ! Best