Open lachlancoin opened 5 years ago
I wonder what was the bam file size in your case?
At the moment whole content of the uploaded file is fetched into the process memory.
As a straight-forward solution you can try to specify Dataflow worker machine with more memory using following option:
—workerMachineType=n1-highmem-4
n1-highmem-4 has 26GB RAM while the default Dataflow worker machine for streaming mode is n1-standard-4 with 15GB RAM.
so using smaller bams solved the memory issue, but it still seems to be going down the bwa-mem route. I also tried uploading SAM files, but no more luck. I couldnt figure out how the code base distinguishes a fastq input from a bam/sam input, the pipeline seems to be the same in either case
Currently this feature is in the separate git branch bam_files
, from stackrace it looks that you are using code compiled from master
branch
Oh yes, sorry. Using that branch I get the following error many times:
java.lang.StringIndexOutOfBoundsException: String index out of range: 23 java.lang.String.substring(String.java:1963) com.google.allenday.nanostream.pubsub.GCSSourceData.fromGCloudNotification(GCSSourceData.java:49) com.google.allenday.nanostream.gcs.ParseGCloudNotification.processElement(ParseGCloudNotification.java:16)
Seems to be something to do with the location of the sam files?
the object location is objectId=Uploads/ICUNEW/out.sam which shouldnt cause any issues. It looks like a problem with a trailing / but I cant find one in this case
I believe I've made off-by-1 error, I'll do a fix now
That seems better, but now it just seems to get stuck at the GroupBySamReference step - about 5,8Mb input and no output
On Wed, 13 Mar 2019 at 22:57, Alexander Bushkovsky notifications@github.com wrote:
Fixed:
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenday/nanostream-dataflow/issues/102#issuecomment-472410996, or mute the thread https://github.com/notifications/unsubscribe-auth/AD01ZB1VANJENXK1ZFD_VO4qz_IHCd0_ks5vWPXLgaJpZM4bq6qP .
-- Group leader, Institute for Molecular Bioscience, University of Queensland Senior Lecturer, Imperial College http://academickarma.org/0000-0002-4300-455X http://orcid.org/0000-0002-4300-455X
I am testing the bam input option and getting following error:
java.lang.OutOfMemoryError: Java heap space java.util.Arrays.copyOf(Arrays.java:3236) java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) com.google.api.client.util.ByteStreams.copy(ByteStreams.java:55) com.google.api.client.util.IOUtils.copy(IOUtils.java:94) com.google.api.client.util.IOUtils.copy(IOUtils.java:63) com.google.api.client.http.HttpResponse.download(HttpResponse.java:421) com.google.cloud.storage.spi.v1.HttpStorageRpc.load(HttpStorageRpc.java:585) com.google.cloud.storage.StorageImpl$16.call(StorageImpl.java:464) com.google.cloud.storage.StorageImpl$16.call(StorageImpl.java:461) com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105) com.google.cloud.RetryHelper.run(RetryHelper.java:76) com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50) com.google.cloud.storage.StorageImpl.readAllBytes(StorageImpl.java:461) com.google.cloud.storage.Blob.getContent(Blob.java:478) com.google.allenday.nanostream.gcs.GetDataFromFastQFile.processElement(GetDataFromFastQFile.java:37) com.google.allenday.nanostream.gcs.GetDataFromFastQFile$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240) org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325) org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44) org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49) org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:272) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:309) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:77) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:621) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:609) com.google.allenday.nanostream.gcs.ParseGCloudNotification.processElement(ParseGCloudNotification.java:16) com.google.allenday.nanostream.gcs.ParseGCloudNotification$DoFnInvoker.invokeProcessElement(Unknown Source) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)