java.lang.ArrayIndexOutOfBoundsException: 47118

jdmontenegro commented 2 years ago

hi, I am running gusher from inside the braker pipeline. For some reason the first time I ran it on a chromosome-level assembly it worked nicely, but now, running it on a scaffold-level assembly (25K scaffolds) I keep getting the following error when running gush:

> java -jar /scratch/molevo/jmontenegro/software/GUSHR/GeMoMa-1.6.2.jar CLI AnnotationFinalizer u=YES g=genome.fa a=gushr-TIJPJDZUYCXQ/complete_gemoma_like.gff3 i=gushr-TIJPJDZUYCXQ/introns.gff c=UNSTRANDED coverage_unstranded=gushr-TIJPJDZUYCXQ/coverage.bedgraph rename=NO outdir=gushr-TIJPJDZUYCXQ/
jar time stamp: Sat Aug 20 17:22:40 CEST 2022

Searching for the new GeMoMa updates ...
You are using GeMoMa 1.6.2, but the latest version is 1.9.
You can download the latest version from http://www.jstacs.de/index.php/GeMoMa

Parameters of tool "AnnotationFinalizer" (AnnotationFinalizer, version: 1.6.2):
a - annotation (The predicted genome annotation file (GFF)) = gushr-TIJPJDZUYCXQ/complete_gemoma_like.gff3
t - tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)   = prediction
u - UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)  = YES
    No parameters for selection "NO"
    Parameters for selection "YES":
        g - genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)  = genome.fa
            The following parameter(s) can be used multiple times:
            i - introns file (Introns (GFF), which might be obtained from RNA-seq)  = gushr-TIJPJDZUYCXQ/introns.gff
        r - reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1) = 1
            The following parameter(s) can be used multiple times:
            c - coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO) = UNSTRANDED
                No parameters for selection "NO"
                Parameters for selection "UNSTRANDED":
                    coverage_unstranded - coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)    = gushr-TIJPJDZUYCXQ/coverage.bedgraph
                Parameters for selection "STRANDED":
                    coverage_forward - coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.) = null
                    coverage_reverse - coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.) = null
rename - rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)  = NO
         Parameters for selection "COMPOSED":
            p - prefix (the prefix of the generic name) = null
            infix - infix (the infix of the generic name, default = G)  = G
            s - suffix (the suffix of the generic name, default = 0)    = 0
            d - digits (the number of informative digits, valid range = [4, 10], default = 5)   = 5
            di - delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )  = 
         Parameters for selection "SIMPLE":
            p - prefix (the prefix of the generic name) = null
            d - digits (the number of informative digits, valid range = [4, 10], default = 5)   = 5
         No parameters for selection "NO"
outdir - The output directory, defaults to the current working directory (.)    = gushr-TIJPJDZUYCXQ/
genome parts: 25454 [Seg10865, Seg10864, Seg10863, Seg10862, Seg10869, Seg10868, Seg10867, Seg10866, Seg9583, Seg9584, Seg9585, Seg9586, Seg22850, Seg9580, Seg22851, Seg9581, Seg9582, Seg19202, Seg22843, Seg19201, Seg228...
possible introns from RNA-seq (split reads>=1): 864409
+: 163226
-: 170825
.: 265179
Check RNA-seq data (introns): 48% of the sequences in the reference genome are covered.

#genes: 52801
#warnings: [0, 0]
#predictions: 52801
#warnings: [0, 0]
#CDSs: 237069
#warnings: [0, 0]
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 47118
    at projects.gemoma.AnnotationFinalizer.extendUTR(AnnotationFinalizer.java:673)
    at projects.gemoma.AnnotationFinalizer.run(AnnotationFinalizer.java:564)
    at projects.gemoma.AnnotationFinalizer.run(AnnotationFinalizer.java:444)
    at de.jstacs.tools.ui.cli.CLI.run(CLI.java:427)
    at projects.gemoma.GeMoMa.main(GeMoMa.java:368)

I am trying to understand what else could be going on here and how to fix it or work around it. The original braker command was as follows:

braker.pl --cores 16 --species=new --softmasking --UTR=on --workingdir=/tmp/slurm-5396296/braker2_rna --AUGUSTUS_BIN_PATH=/apps/augustus/3.4.0/bin --AUGUSTUS_SCRIPTS_PATH=/apps/augustus/3.4.0/scripts --genome=genome.fa --bam=merged.dd.bam

Any help would be much appreciated.

Regards,

Juan D.

Aswin2667 commented 2 years ago

Can i do this?

jdmontenegro commented 2 years ago

Would it make sense to upgrade GEMOMA to 1.9? would braker/gushr still be compatible with that version?

tomomano commented 2 years ago

@jdmontenegro

My comment here may fix your problem. https://github.com/Gaius-Augustus/BRAKER/issues/456#issuecomment-1279998635

Gaius-Augustus / GUSHR

java.lang.ArrayIndexOutOfBoundsException: 47118 #7