broadinstitute / pilon

Pilon is an automated genome assembly improvement and variant detection tool
GNU General Public License v2.0
340 stars 60 forks source link

Issue running pilon : java.lang.OutOfMemoryError: Java heap space #44

Closed RxLoutre closed 7 years ago

RxLoutre commented 7 years ago

Hi, I have tried to use pilon using the following command :

java -jar pilon-1.21.jar --genome '/media/loutre/SUZUKII/assembly/merged/3-suzukii-polished-80-merged-renamed.fasta' --frags '/media/loutre/SUZUKII/annotation/evidences/rna/hisat/80x-illumina-suzukii-sorted.bam' --diploid --outdir '/media/loutre/SUZUKII/polishing' --output pilon80x-polishing-illumina --threads 32 --debug

And I got the following output :


Pilon version 1.21 Fri Dec 9 16:44:44 2016 -0500
Genome: /media/loutre/SUZUKII/assembly/merged/3-suzukii-polished-80-merged-renamed.fasta
Fixing snps, indels, gaps, local
Input genome size: 286810664
Scanning BAMs
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.simontuffs.onejar.Boot.run(Boot.java:340)
    at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:198)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:660)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:634)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:628)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:598)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:544)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:518)
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at org.broadinstitute.pilon.BamFile.scan(BamFile.scala:285)
    at org.broadinstitute.pilon.GenomeFile$$anonfun$processRegions$3.apply(GenomeFile.scala:93)
    at org.broadinstitute.pilon.GenomeFile$$anonfun$processRegions$3.apply(GenomeFile.scala:93)
    at scala.collection.parallel.AugmentedIterableIterator$class.map2combiner(RemainsIterator.scala:115)
    at scala.collection.parallel.immutable.ParVector$ParVectorIterator.map2combiner(ParVector.scala:62)
    at scala.collection.parallel.ParIterableLike$Map.leaf(ParIterableLike.scala:1054)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
    at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
    at scala.collection.parallel.ParIterableLike$Map.tryLeaf(ParIterableLike.scala:1051)
    at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
    at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341)
    at scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673)
    at scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:378)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:443)
    at scala.collection.parallel.ForkJoinTasks$class.executeAndWaitResult(Tasks.scala:426)
    at scala.collection.parallel.ForkJoinTaskSupport.executeAndWaitResult(TaskSupport.scala:56)
    at scala.collection.parallel.ParIterableLike$ResultMapping.leaf(ParIterableLike.scala:958)

The Illumina reads were aligned using Hisat2, then indexed and sorted using samtools

Did I made something wrong ?

Thanks for your help,

Roxane

shenwei356 commented 7 years ago

Try feed java more memory, e.g., 20G, by java -jar xxx.jar -Xmx20G

RxLoutre commented 7 years ago

It gaves the same error even up to 100G of memory... Any others suggestions ?

Jintlich commented 7 years ago

That means you need more memory... I have 2.8G genome, 40x Illumina BAM. Had the same problem, solved with -Xmx160G.

xiexr commented 7 years ago

I met the same error. The way i solved the error is by increasing the memery to a higher level, -Xmx120G. I think pilon is too consume the memery to run. Is there any improvement to solve it?

w1bw commented 7 years ago

The primary use case for Pilon when it was written was for smaller genomes. I'm happy people have had as much success as they have with larger genomes. The time and space efficiency could be improved, but some things would need to be completely re-written and would run more slowly to minimize memory footprints. I'll keep this in mind.

Rob-murphys commented 4 years ago

How do I assign more memory when using a conda environment and submitting to slurm? If Pilon is not designed for large genomes, what other tools would you suggest?

hoelzer commented 4 years ago

push this to follow up @Lamm-a question

zolotarovgl commented 4 years ago

Processing a large genome without crashing a server Hi, In my opinion, pilon doesn't write an intermediate results ( results from chromosomes) into the output but stores everything in memory instead. Thus, memory usage grows as you progress through the genome. In my case ( 2.7 Gb genome, ~100x coverage) it eventually takes more than 300 GB RAM. As suggested above, one approach would be to split the genome into individual chromosomes and then run pilon on each one of them separately:

GENOMEFA=<path to the genome>
OUTDIR=pilon_out
PILONJAR=<path to pilon .jar>
JAVA_TOOL_OPTIONS="-Xmx200G -Xss2560k" # set maximum heap size to something reasonable
####### Split the genome
# split the genome into single chromsomes 
bioawk -c fastx '{print $name}' $GENOMEFA > nms
mkdir -p split
samtools faidx $GENOMEFA
for CHR in $(cat nms);do 
samtools faidx $GENOMEFA $CHR > split/$CHR\.fa; 
done
# this produced split/ directory with individual fastas for each chromosome
####### Run pilon
mkdir -p $OUTDIR
# increase maximum java heap size, as the application crashes otherwise.
# Other option is to split the genome into chunks corresponding to single chromsomes
ls split/*fa > toprocess
for CHRFILE in $(cat toprocess); do
echo $CHRFILE
CHRNAME=$(basename $CHRFILE | cut -f 1 -d '.')
CHROUTDIR=$OUTDIR/$CHRNAME

java -jar $PILONJAR --nostrays --vcf --tracks --changes --genome $CHRFILE --output $CHRNAME --outdir $CHROUTDIR --fix snps  --frags dnaseq/SRR7898210.sorted.bam --frags <your dnaseq data>
done

This will produce pilon_out directory with the files for each chromosome. You can then easily concatenate output files as following:

cat pilon_out/*/*fasta > whole_genome_pilon.fasta
xiexr commented 4 years ago

Thanks for your suggestion!

At 2020-08-17 02:23:34, "Grygoriy Zolotarov" notifications@github.com wrote:

Hi, In my opinion, pilon doesn't write an intermediate results ( results from chromosomes) into the output but stores everything in memory instead. Thus, memory usage grows as you progress through the genome. In my case ( 2.7 Gb genome, ~100x coverage) it eventually takes more than 300 GB RAM. As suggested above, one approach would be to split the genome into individual chromosomes and then run pilon on each one of them separately:

GENOMEFA= OUTDIR=pilon_out PILONJAR=<path to pilon .jar>

####### Split the genome

split the genome into single chromsomes

bioawk -c fastx '{print $name}' $GENOMEFA > nms mkdir -p split samtools faidx $GENOMEFA for CHR in $(cat nms);do samtools faidx $GENOMEFA $CHR > split/$CHR.fa; done

this produced split/ directory with individual fastas for each chromosome

####### Run pilon mkdir -p $OUTDIR

increase maximum java heap size, as the application crashes otherwise.

Other option is to split the genome into chunks corresponding to single chromsomes

ls split/*fa > toprocess for CHRFILE in $(cat toprocess); do echo $CHRFILE CHRNAME=$(basename $CHRFILE | cut -f 1 -d '.') CHROUTDIR=$OUTDIR/$CHRNAME JAVA_TOOL_OPTIONS="-Xmx200G -Xss2560k" java -jar $PILONJAR --nostrays --vcf --tracks --changes --genome $CHRFILE --output $CHRNAME --outdir $CHROUTDIR --fix snps --frags dnaseq/SRR7898210.sorted.bam --frags done

This will produce pilon_out directory with the files for each chromosome. You can then easily concatenate output files as following:

cat pilon_out//fasta > whole_genome_pilon.fasta

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

pallevillesen commented 3 years ago

@zolotarovgl Thank you! Just saved me a lot of work on our cluster when polishing 3gbp genomes with ~50X bams.