allenday / nanostream-dataflow

real-time stream processing of DNA nanopore sequencer reads with dataflow
MIT License
27 stars 9 forks source link

out of memory error #102

Open lachlancoin opened 5 years ago

lachlancoin commented 5 years ago

I am testing the bam input option and getting following error:

java.lang.OutOfMemoryError: Java heap space java.util.Arrays.copyOf(Arrays.java:3236) java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) com.google.api.client.util.ByteStreams.copy(ByteStreams.java:55) com.google.api.client.util.IOUtils.copy(IOUtils.java:94) com.google.api.client.util.IOUtils.copy(IOUtils.java:63) com.google.api.client.http.HttpResponse.download(HttpResponse.java:421) com.google.cloud.storage.spi.v1.HttpStorageRpc.load(HttpStorageRpc.java:585) com.google.cloud.storage.StorageImpl$16.call(StorageImpl.java:464) com.google.cloud.storage.StorageImpl$16.call(StorageImpl.java:461) com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105) com.google.cloud.RetryHelper.run(RetryHelper.java:76) com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50) com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:461) com.google.cloud.storage.Blob.getContent(Blob.java:478) com.google.allenday.nanostream.gcs.GetDataFromFastQFile.processElement(GetDataFromFastQFile.java:37) com.google.allenday.nanostream.gcs.GetDataFromFastQFile$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240) org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325) org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44) org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49) org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:272) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:309) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:77) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:621) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:609) com.google.allenday.nanostream.gcs.ParseGCloudNotification.processElement(ParseGCloudNotification.java:16) com.google.allenday.nanostream.gcs.ParseGCloudNotification$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)

obsh commented 5 years ago

I wonder what was the bam file size in your case? At the moment whole content of the uploaded file is fetched into the process memory. As a straight-forward solution you can try to specify Dataflow worker machine with more memory using following option: —workerMachineType=n1-highmem-4 n1-highmem-4 has 26GB RAM while the default Dataflow worker machine for streaming mode is n1-standard-4 with 15GB RAM.

lachlancoin commented 5 years ago

so using smaller bams solved the memory issue, but it still seems to be going down the bwa-mem route. I also tried uploading SAM files, but no more luck. I couldnt figure out how the code base distinguishes a fastq input from a bam/sam input, the pipeline seems to be the same in either case

obsh commented 5 years ago

Currently this feature is in the separate git branch bam_files, from stackrace it looks that you are using code compiled from master branch

lachlancoin commented 5 years ago

Oh yes, sorry. Using that branch I get the following error many times:

java.lang.StringIndexOutOfBoundsException: String index out of range: 23 java.lang.String.substring(String.java:1963) com.google.allenday.nanostream.pubsub.GCSSourceData.fromGCloudNotification(GCSSourceData.java:49) com.google.allenday.nanostream.gcs.ParseGCloudNotification.processElement(ParseGCloudNotification.java:16)

Seems to be something to do with the location of the sam files?

lachlancoin commented 5 years ago

the object location is objectId=Uploads/ICUNEW/out.sam which shouldnt cause any issues. It looks like a problem with a trailing / but I cant find one in this case

obsh commented 5 years ago

I believe I've made off-by-1 error, I'll do a fix now

obsh commented 5 years ago

Fixed: https://github.com/allenday/nanostream-dataflow/blob/bam_files/NanostreamDataflowMain/src/main/java/com/google/allenday/nanostream/pubsub/GCSSourceData.java#L49

lachlancoin commented 5 years ago

That seems better, but now it just seems to get stuck at the GroupBySamReference step - about 5,8Mb input and no output

On Wed, 13 Mar 2019 at 22:57, Alexander Bushkovsky notifications@github.com wrote:

Fixed:

https://github.com/allenday/nanostream-dataflow/blob/bam_files/NanostreamDataflowMain/src/main/java/com/google/allenday/nanostream/pubsub/GCSSourceData.java#L49

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenday/nanostream-dataflow/issues/102#issuecomment-472410996, or mute the thread https://github.com/notifications/unsubscribe-auth/AD01ZB1VANJENXK1ZFD_VO4qz_IHCd0_ks5vWPXLgaJpZM4bq6qP .

-- Group leader, Institute for Molecular Bioscience, University of Queensland Senior Lecturer, Imperial College http://academickarma.org/0000-0002-4300-455X http://orcid.org/0000-0002-4300-455X