Memory issues with GroupReadsByUmi and consensus calling in latest devel

chapmanb commented 7 years ago

Tim and Nils; Thanks for all the recent work supporting duplex reads. I've been working on adding support for this into bcbio and updated to the latest development version to get the duplex fixes in place since 1.4. Everything runs cleanly on smaller test samples but on reasonably sized test sample (5Gb inputs) I'm hitting memory problems even when allocating lots of Java memory. I see this both with duplex and standard UMI consensus creation. For completeness, here is the command I'm running which pipes GroupReadsByUmi into consensus calling and fastq creation:

fgbio -Xms500m -Xmx22g -XX:+UseSerialGC -Djava.io.tmpdir=/gpfs/ngs/oncology/Analysis/dev/Dev_0356_HiSeq4000_TechDevControl_Watson/bcbio_umi/work/bcbiotx/tmp7l2SPs -Dsamjdk.use_async_io_read_samtools=true -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.compression_level=0 GroupReadsByUmi --edits 1 --min-map-q 1 -t XC -s paired -i /gpfs/ngs/oncology/Analysis/dev/Dev_0356_HiSeq4000_TechDevControl_Watson/bcbio_umi/work/align/ILS34875D1/ILS34875D1-sort.bam | fgbio -Xms500m -Xmx22g -XX:+UseSerialGC -Djava.io.tmpdir=/gpfs/ngs/oncology/Analysis/dev/Dev_0356_HiSeq4000_TechDevControl_Watson/bcbio_umi/work/bcbiotx/tmp7l2SPs -Dsamjdk.use_async_io_read_samtools=true -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.compression_level=0 CallDuplexConsensusReads --min-input-base-quality 2 --sort-order=unsorted -i /dev/stdin -o /dev/stdout | bamtofastq collate=1 T=/gpfs/ngs/oncology/Analysis/dev/Dev_0356_HiSeq4000_TechDevControl_Watson/bcbio_umi/work/align/ILS34875D1/ILS34875D1-sort-cumi-1-bamtofastq-tmp F=/gpfs/ngs/oncology/Analysis/dev/Dev_0356_HiSeq4000_TechDevControl_Watson/bcbio_umi/work/bcbiotx/tmp7l2SPs/ILS34875D1-sort-cumi-1.fq.gz F2=/gpfs/ngs/oncology/Analysis/dev/Dev_0356_HiSeq4000_TechDevControl_Watson/bcbio_umi/work/bcbiotx/tmp7l2SPs/ILS34875D1-sort-cumi-2.fq.gz tags=cD,cM,cE gz=1

The memory errors are not very helpful to isolate the issue as they all appear to come from command line parsing which doesn't seem right. I've included a few of them below in case they are more informative to you than me. In digging through the fgbio changes I've noticed the big update to improve BAM read/writes. Is there anything I'm doing incorrectly to interact with that?

Thanks for any suggestions.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
    at java.lang.StringBuilder.<init>(StringBuilder.java:89)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor.indentAndWrap(MarkDownProcessor.scala:169)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor.$anonfun$toText$1(MarkDownProcessor.scala:77)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor$$Lambda$336/1265287441.apply(Unknown Source)
    at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor.toText(MarkDownProcessor.scala:77)
    at com.fulcrumgenomics.sopt.cmdline.ClpArgumentDefinitionPrinting$.printArgumentUsage(ClpArgumentDefinitionPrinting.scala:151)
    at com.fulcrumgenomics.sopt.cmdline.ClpArgumentDefinitionPrinting$.printArgumentDefinitionUsage(ClpArgumentDefinitionPrinting.scala:58)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser.$anonfun$usage$5(CommandLineProgramParser.scala:242)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser.$anonfun$usage$5$adapted(CommandLineProgramParser.scala:241)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser$$Lambda$321/906448455.apply(Unknown Source)
    at scala.collection.immutable.List.foreach(List.scala:389)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser.usage(CommandLineProgramParser.scala:241)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineParser.parseCommandAndSubCommand(CommandLineParser.scala:411)
    at com.fulcrumgenomics.sopt.Sopt$.parseCommandAndSubCommand(Sopt.scala:156)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:83)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:74)
    at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:45)
    at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:177)
    at scala.collection.mutable.ListBuffer.$plus$eq(ListBuffer.scala:44)
    at scala.collection.generic.Growable.$anonfun$$plus$plus$eq$1(Growable.scala:59)
    at scala.collection.generic.Growable$$Lambda$7/1147258851.apply(Unknown Source)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:59)
    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:50)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:186)
    at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:44)
    at scala.collection.mutable.BufferLike.appendAll(BufferLike.scala:151)
    at scala.collection.mutable.BufferLike.appendAll$(BufferLike.scala:151)
    at scala.collection.mutable.AbstractBuffer.appendAll(Buffer.scala:49)
    at scala.collection.mutable.BufferLike.append(BufferLike.scala:146)
    at scala.collection.mutable.BufferLike.append$(BufferLike.scala:146)
    at scala.collection.mutable.AbstractBuffer.append(Buffer.scala:49)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor.indentAndWrap(MarkDownProcessor.scala:169)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor.$anonfun$toText$1(MarkDownProcessor.scala:77)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor$$Lambda$336/1265287441.apply(Unknown Source)
    at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor.toText(MarkDownProcessor.scala:77)
    at com.fulcrumgenomics.sopt.cmdline.ClpArgumentDefinitionPrinting$.printArgumentUsage(ClpArgumentDefinitionPrinting.scala:151)
    at com.fulcrumgenomics.sopt.cmdline.ClpArgumentDefinitionPrinting$.printArgumentDefinitionUsage(ClpArgumentDefinitionPrinting.scala:58)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser.$anonfun$usage$5(CommandLineProgramParser.scala:242)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser.$anonfun$usage$5$adapted(CommandLineProgramParser.scala:241)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser$$Lambda$321/906448455.apply(Unknown Source)
    at scala.collection.immutable.List.foreach(List.scala:389)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser.usage(CommandLineProgramParser.scala:241)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineParser.parseCommandAndSubCommand(CommandLineParser.scala:411)
    at com.fulcrumgenomics.sopt.Sopt$.parseCommandAndSubCommand(Sopt.scala:156)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:83)

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
    at java.lang.StringBuilder.toString(StringBuilder.java:407)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor.indentAndWrap(MarkDownProcessor.scala:169)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor.$anonfun$toText$1(MarkDownProcessor.scala:77)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor$$Lambda$336/1265287441.apply(Unknown Source)
    at scala.collection.immutable.Stream.flatMap(Stream.scala:486)
    at com.fulcrumgenomics.sopt.util.MarkDownProcessor.toText(MarkDownProcessor.scala:77)
    at com.fulcrumgenomics.sopt.cmdline.ClpArgumentDefinitionPrinting$.printArgumentUsage(ClpArgumentDefinitionPrinting.scala:151)
    at com.fulcrumgenomics.sopt.cmdline.ClpArgumentDefinitionPrinting$.printArgumentDefinitionUsage(ClpArgumentDefinitionPrinting.scala:58)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser.$anonfun$usage$5(CommandLineProgramParser.scala:242)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser.$anonfun$usage$5$adapted(CommandLineProgramParser.scala:241)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser$$Lambda$321/906448455.apply(Unknown Source)
    at scala.collection.immutable.List.foreach(List.scala:389)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineProgramParser.usage(CommandLineProgramParser.scala:241)
    at com.fulcrumgenomics.sopt.cmdline.CommandLineParser.parseCommandAndSubCommand(CommandLineParser.scala:411)
    at com.fulcrumgenomics.sopt.Sopt$.parseCommandAndSubCommand(Sopt.scala:156)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:83)
    at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:74)
    at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:45)
    at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)

tfenne commented 7 years ago

Hey @chapmanb - I think you're intuition is right that there's something strange going on here. A few thoughts:

The latest (master) version of GroupReadsByUmi has two places that use significant memory. The first is sorting, but that is limited to 1m records in RAM at the same time, so for BAM that have records of similar size, and > 1m reads, they should all work about the same. The second place is in grouping where it pulls in all the reads with the same template start/end in a group. But I'd be shocked to find you had > 1m reads all sharing the same start/stop and only disambiguated by UMIs.
Looking at the stack traces, the errors really are coming out of the command line parsing which is super odd
I noticed the command starts with fgbio -Xms500m -Xmx22g, does that definitely get expanded to java -Xms500m -Xmx22g -jar fgbio.jar? I'm guessing it must otherwise you'd get some other failure message, but I had to ask.

I'd be really curious to see how the command line gets re-written and executed. The only way I can think that it would run out of memory in this weird way would be if perhaps somehow the command line gets re-written to cause the input BAM's contents to be read as a command line parameter, but that seems far fetched.

Any chance you could either:

Try running this BAM by hand on the command line with a simplified command line like java -Xmx4g -jar fgbio.jar GroupReadsByUmi -i in.bam -o grouped.bam -s paired and see if you get the same problem?
Share the BAM with us? I understand this may not be possible, but if it is I can provide a private/secure place to upload it

chapmanb commented 7 years ago

Tim; Thanks much for thoughts and directions, this really helped focus on what might be going on. The fgbio wrapper does correctly pass through options, and in the problem of simplifying I realized the issues is due to the length of the temporary directory specification. It is totally bizarre but if you have a tempdir setting like:

-Djava.io.tmpdir=/testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttestt/tmptest12

It'll run fine (well, until it discovers the temporary directory doesn't exist). However if you switch to:

-Djava.io.tmpdir=/testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttestt/tmptest123

then it explodes due to memory usage issues. I can replicate this on different machines with different BAM files and it doesn't appear to be due to input size since it failed with a 4Mb BAM input.

Are you able to reproduce this issue if you try running GroupReadsByUmi with the second longer tmpdir that fails? If not I can put together a full example. Sorry this is so strange but hope it gives some clues about what might be going on.

tfenne commented 7 years ago

I can reproduce from here. I think I see what the problem is. Should have a fix shortly.

tfenne commented 7 years ago

Ok, I have a fix, but @nh13 is offline and I'd like him to review before I merge. The short version is it's nothing to do with the BAM, and is a bug in some unrelated code. We, as part of 0.2.0, have transitioned to writing all our usage and argument documentation in MarkDown, and now have a system to parse and emit website documentation, and also to parse and emit the plain text that comes out when we need to print usage to the console. The problem was is a bug in the conversion to plain text that couldn't handle word-wrapping correctly when a single word was longer than the maximum line length (oops!). When that happened it just kept adding more and more blank lines until it ran out of memory.

The other part to this is that, also in 0.2.0, we've introduced global options to fgbio. I don't know if you noticed but you can now write things like: fgbio --tmp-dir=/tmp --compression=0 --async-io GroupReadsByUmi -i my.bam -o out.bam

And the default value for --tmp-dir is the system property java.io.tmpdir. So if you set that to something really long, and cause the usage to be printed, it triggered this bug.

I'm not sure why, but I think there must have been something in your command line(s) that was not quite right, causing it to try and print the usage.

The fix is a separate project: https://github.com/fulcrumgenomics/sopt/pull/7 . Once that PR is merged you should be able to simply rebuild fgbio and it should pull in the updated snapshot.

nh13 commented 7 years ago

@chapmanb @tfenne I merged it into sopt, so depending on how long it takes to get to our snapshots repo, it should be as simple as rebuilding fgbio (run sbt update && sbt assembly).

chapmanb commented 7 years ago

Tim and Nils; Thanks so much for looking at this and the quick fix. Do you know if the sopt snapshot has gotten updated (or is there a way I can monitor)? I tried re-building everything this morning but am still running into the same memory usage issues with long tmpdir specifications. I'm not sure if I'm missing it, there is still an issue, or if I'm just not patient enough. Thanks again.

tfenne commented 7 years ago

It usually doesn't take very long to propagate. The snapshots for sopt are here and would indicate that it updated within an hour or two of Nils' comment.

Automatic dependency management and snapshots seem to run into trouble from time to time and it's a PITA to figure out. Probably the easiest solution is to checkout sopt and run sbt publishLocal and then in fgbio sbt clean assembly. Nils' is also in the middle of making the 0.2.0 releases which should be available soon too if you continue to run into problems.

chapmanb commented 7 years ago

Tim -- thanks for the details, either I can't seem to get it to update or it's not fixing the issue but wasn't able to figure out a build with this. I could work around it by using the new global options instead of java specific ones so switched over to doing that to get over the immediate problem. I'm happy to check again once the 0.2.0 releases are available, and will update the conda packages to use those. Thanks again for all the help.

nh13 commented 7 years ago

@chapmanb we have released a new version of fgbio. Would you mind trying the JAR attached to the release?

chapmanb commented 7 years ago

Nils and Tim -- thanks so much. I updated and tested this and the 0.2.0 release does resolve the issue. Thanks again for the help fixing this.

fulcrumgenomics / fgbio

Memory issues with GroupReadsByUmi and consensus calling in latest devel #256