broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

StructuralVariationDiscoveryPipelineSpark Aborts #5942

Closed jjfarrell closed 3 years ago

jjfarrell commented 5 years ago

Bug Report

Affected tool(s) or class(es)

StructuralVariationDiscoveryPipelineSpark

Affected version(s)

GATK 4.1.2.0

Description

At end of run on a Hadoop cluster, the job aborts....

services=List(), started=false) 2019-05-14 17:07:05 INFO YarnClientSchedulerBackend:54 - Stopped 2019-05-14 17:07:05 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped! 2019-05-14 17:07:05 INFO MemoryStore:54 - MemoryStore cleared 2019-05-14 17:07:05 INFO BlockManager:54 - BlockManager stopped 2019-05-14 17:07:05 INFO BlockManagerMaster:54 - BlockManagerMaster stopped 2019-05-14 17:07:05 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped! 2019-05-14 17:07:05 INFO SparkContext:54 - Successfully stopped SparkContext 17:07:05.631 INFO StructuralVariationDiscoveryPipelineSpark - Shutting down engine [May 14, 2019 5:07:05 PM EDT] org.broadinstitute.hellbender.tools.spark.sv.StructuralVariationDiscoveryPipelineSpark done. Elapsed time: 41.02 minutes. Runtime.totalMemory()=23321378816 java.lang.IllegalArgumentException: Wrong FS: hdfs://scc:-1/project/casa/gcad/adsp.cc/sv/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.contig-sam-file.sam, expected: hdfs://scc at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193) at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:105) at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:397) at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:393) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:393) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:337) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775) at hdfs.jsr203.HadoopFileSystem.newOutputStream(HadoopFileSystem.java:554) at hdfs.jsr203.HadoopFileSystem.newByteChannel(HadoopFileSystem.java:395) at hdfs.jsr203.HadoopPath.newByteChannel(HadoopPath.java:558) at hdfs.jsr203.HadoopFileSystemProvider.newByteChannel(HadoopFileSystemProvider.java:146) at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434) at java.nio.file.Files.newOutputStream(Files.java:216) at htsjdk.samtools.SAMFileWriterFactory.makeSAMWriter(SAMFileWriterFactory.java:352) at htsjdk.samtools.SAMFileWriterFactory.makeSAMOrBAMWriter(SAMFileWriterFactory.java:437) at org.broadinstitute.hellbender.tools.spark.sv.utils.SVFileUtils.writeSAMFile(SVFileUtils.java:29) at org.broadinstitute.hellbender.tools.spark.sv.evidence.AlignedAssemblyOrExcuse.writeAssemblySAMFile(AlignedAssemblyOrExcuse.java:336) at org.broadinstitute.hellbender.tools.spark.sv.evidence.FindBreakpointEvidenceSpark.gatherEvidenceAndWriteContigSamFile(FindBreakpointEvidenceSpark.java:199) at org.broadinstitute.hellbender.tools.spark.sv.StructuralVariationDiscoveryPipelineSpark.runTool(StructuralVariationDiscoveryPipelineSpark.java:164) at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:528) at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205) at org.broadinstitute.hellbender.Main.main(Main.java:291) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2019-05-14 17:07:05 INFO ShutdownHookManager:54 - Shutdown hook called 2019-05-14 17:07:05 INFO ShutdownHookManager:54 - Deleting directory /restricted/projectnb/casa/wgs.hg38/sv/gatk.sv/tmp/spark-45f7a9f3-b94f-4040-bf32-0dbfe44f8f68 2019-05-14 17:07:05 INFO ShutdownHookManager:54 - Deleting directory /restricted/projectnb/casa/wgs.hg38/sv/gatk.sv/tmp/spark-70db8953-5dec-4eb8-910d-f0abd7e1c42b

real 41m12.118s user 83m41.069s sys 10m15.403s

Steps to reproduce

atk --java-options "-Djava.io.tmpdir=tmp" StructuralVariationDiscoveryPipelineSpark \ -R $REF \ --aligner-index-image GRCh38_full_analysis_set_plus_decoy_hla.fa.img \ --kmers-to-ignore GRCh38_ignored_kmers.txt \ --contig-sam-file hdfs:///project/casa/gcad/$CENTER/sv/$SAMPLE.contig-sam-file.sam\ -I $CRAM_DIR/$SAMPLE.cram \ -O hdfs:///project/casa/gcad/$CENTER/sv/$SAMPLE.sv.vcf.gz \ -- \ --spark-runner SPARK --spark-master yarn --deploy-mode client \ --executor-memory 85G\ --driver-memory 30g\ --num-executors 40\ --executor-cores 4\ --conf spark.yarn.submit.waitAppCompletion=false\ --name "$SAMPLE" \ --files $REF.img,$KMER \ --conf spark.yarn.executor.memoryOverhead=5000 \ --conf spark.network.timeout=600 \ --conf spark.executor.heartbeatInterval=120

Expected behavior

Should complete and write output files.

Actual behavior

Job aborts after running 45 min and no output files are written. The error message refers to filename that is not actually passed as a parameter to the tool: hdfs://scc:-1/. Not sure where the -1 is coming from.


SHuang-Broad commented 5 years ago

@jjfarrell Thanks for reporting!

I'm a bit confused by the input and output path you provided

hdfs:///project/casa/gcad/$CENTER/sv/$SAMPLE.contig-sam-file.sam
hdfs:///project/casa/gcad/$CENTER/sv/$SAMPLE.sv.vcf.gz 

Specifically, there are three slashes after hdfs: Can you make sure they are the intended path?

Thanks!

jjfarrell commented 5 years ago

@SHuang-Broad

To indicate that the output directory is on our hadoop cluster the command lines need to use the hdfs:// prefix. Otherwise, it tries to write to the local file system. Usually for spark jobs, this is not necessary (hdfs is default) . In contrast, the bam file parameter does not need the hdfs:// prefix.

That filename is from a bash script and after substitution the directory path becomes: hdfs:///project/casa/gcad/adsp.cc/sv

The directory is on the hadoop cluster:

hadoop fs -ls /project/casa/gcad/adsp.cc/ drwxrwxr-x - farrell casa 0 2018-05-08 15:21 /project/casa/gcad/adsp.cc/sv

So it looks like the intended path.

SHuang-Broad commented 5 years ago

Then I think this might be due to some assumptions made by IOUtils.getPath(String ...) being violated.

Can we get one of you for input @cmnbroad @droazen ?

cmnbroad commented 5 years ago

@jfarrell Do you recognize "scc" as a local host name ? "hdfs:///project/casa/gcad/adsp.cc/sv" looks reasonable enough as a file URI, except that the hadoop file system provider requires an authority component (the part of the uri between the second and third slash: "hdfs://authority-component/...") be provided in such URIs. Since you didn't include one as part of the hdfs path on the command line, it looks like transform along the way resulted in one being added (the authority component looks like "host:port"), resulting in the port number -1. So I'm not clear if its a configuration issue, or a bad code code path, or both. But I would suggest trying an hdfs path with a valid authority component (one that works with the hadoop shell).

@SHuang-Broad I do see some code paths in StructuralVariationDiscoveryPipelineSpark that call Paths.get directly, rather than IOUtils.getPath(). I would also suggest replacing the direct calls to makeSAMOrBAMWriter in SVFileUtils with the GATK wrapper code.

jjfarrell commented 5 years ago

@cmnbroad @SHuang-Broad

The cluster uses Kerberos for authentication. This style of pathname works for reading the cram file which is on the hdfs file system.

Using the hadoop shell works fine.... hadoop fs -ls hdfs:///project/casa/gcad/adsp.cc Found 2 items drwxrwxr-x - zhucc casa 0 2018-04-27 14:59 hdfs:///project/casa/gcad/adsp.cc/cram drwxrwxr-x - farrell casa 0 2018-05-08 15:21 hdfs:///project/casa/gcad/adsp.cc/sv

When I change this to a local file a similar error occurs. The program runs for 40 plus minutes and then gets the following error.

2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 92.0 in stage 13.0 (TID 68093) in 1108 ms on scc-q01.scc.bu.edu (executor 24) (101/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 101.0 in stage 13.0 (TID 68102) in 1061 ms on scc-q01.scc.bu.edu (executor 6) (102/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 34.0 in stage 13.0 (TID 68035) in 1653 ms on scc-q01.scc.bu.edu (executor 24) (103/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 44.0 in stage 13.0 (TID 68045) in 1553 ms on scc-q07.scc.bu.edu (executor 7) (104/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 63.0 in stage 13.0 (TID 68064) in 1362 ms on scc-q01.scc.bu.edu (executor 24) (105/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 102.0 in stage 13.0 (TID 68103) in 1057 ms on scc-q07.scc.bu.edu (executor 7) (106/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 39.0 in stage 13.0 (TID 68040) in 1604 ms on scc-q06.scc.bu.edu (executor 23) (107/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 5.0 in stage 13.0 (TID 68006) in 2015 ms on scc-q01.scc.bu.edu (executor 24) (108/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 10.0 in stage 13.0 (TID 68011) in 1928 ms on scc-q06.scc.bu.edu (executor 23) (109/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 15.0 in stage 13.0 (TID 68016) in 1865 ms on scc-q07.scc.bu.edu (executor 7) (110/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 97.0 in stage 13.0 (TID 68098) in 1083 ms on scc-q06.scc.bu.edu (executor 23) (111/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 73.0 in stage 13.0 (TID 68074) in 1265 ms on scc-q07.scc.bu.edu (executor 7) (112/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 25.0 in stage 13.0 (TID 68026) in 1754 ms on scc-q06.scc.bu.edu (executor 5) (113/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 83.0 in stage 13.0 (TID 68084) in 1169 ms on scc-q06.scc.bu.edu (executor 5) (114/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 54.0 in stage 13.0 (TID 68055) in 1454 ms on scc-q06.scc.bu.edu (executor 5) (115/116)
2019-05-19 19:09:41 INFO  TaskSetManager:54 - Finished task 112.0 in stage 13.0 (TID 68113) in 1010 ms on scc-q06.scc.bu.edu (executor 5) (116/116)
2019-05-19 19:09:41 INFO  YarnScheduler:54 - Removed TaskSet 13.0, whose tasks have all completed, from pool
2019-05-19 19:09:41 INFO  DAGScheduler:54 - ResultStage 13 (foreach at BwaMemIndexCache.java:84) finished in 2.117 s
2019-05-19 19:09:41 INFO  DAGScheduler:54 - Job 9 finished: foreach at BwaMemIndexCache.java:84, took 2.128154 s
2019-05-19 19:09:41 INFO  AbstractConnector:318 - Stopped Spark@42576db9{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-05-19 19:09:41 INFO  SparkUI:54 - Stopped Spark web UI at http://scc-hadoop.bu.edu:4040
2019-05-19 19:09:41 INFO  YarnClientSchedulerBackend:54 - Interrupting monitor thread
2019-05-19 19:09:41 INFO  YarnClientSchedulerBackend:54 - Shutting down all executors
2019-05-19 19:09:41 INFO  YarnSchedulerBackend$YarnDriverEndpoint:54 - Asking each executor to shut down
2019-05-19 19:09:41 INFO  SchedulerExtensionServices:54 - Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
2019-05-19 19:09:41 INFO  YarnClientSchedulerBackend:54 - Stopped
2019-05-19 19:09:41 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-05-19 19:09:41 INFO  MemoryStore:54 - MemoryStore cleared
2019-05-19 19:09:41 INFO  BlockManager:54 - BlockManager stopped
2019-05-19 19:09:41 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2019-05-19 19:09:41 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-05-19 19:09:41 INFO  SparkContext:54 - Successfully stopped SparkContext
19:09:41.578 INFO  StructuralVariationDiscoveryPipelineSpark - Shutting down engine
[May 19, 2019 7:09:41 PM EDT] org.broadinstitute.hellbender.tools.spark.sv.StructuralVariationDiscoveryPipelineSpark done. Elapsed time: 44.89 minutes.
Runtime.totalMemory()=21646802944
htsjdk.samtools.util.RuntimeIOException: Error opening file: file:///restricted/projectnb/casa/wgs.hg38/pipelines/sv/gatk.sv/temp/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.contig-sam-file.sam
        at htsjdk.samtools.SAMFileWriterFactory.makeSAMWriter(SAMFileWriterFactory.java:356)
        at htsjdk.samtools.SAMFileWriterFactory.makeSAMOrBAMWriter(SAMFileWriterFactory.java:437)
        at org.broadinstitute.hellbender.tools.spark.sv.utils.SVFileUtils.writeSAMFile(SVFileUtils.java:29)
        at org.broadinstitute.hellbender.tools.spark.sv.evidence.AlignedAssemblyOrExcuse.writeAssemblySAMFile(AlignedAssemblyOrExcuse.java:336)
        at org.broadinstitute.hellbender.tools.spark.sv.evidence.FindBreakpointEvidenceSpark.gatherEvidenceAndWriteContigSamFile(FindBreakpointEvidenceSpark.java:199)
        at org.broadinstitute.hellbender.tools.spark.sv.StructuralVariationDiscoveryPipelineSpark.runTool(StructuralVariationDiscoveryPipelineSpark.java:164)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:528)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.nio.file.FileSystemException: /restricted/projectnb/casa/wgs.hg38/pipelines/sv/gatk.sv/temp/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.contig-sam-file.sam: Not a directory
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
cmnbroad commented 5 years ago

This style of pathname works for reading the cram file which is on the hdfs file system.

@jjfarrell Do you mean that GATK can read the CRAM file using a ("hdfs://") path with no auth component ?

jjfarrell commented 5 years ago

Yes.

John

On Mon, May 20, 2019 at 1:54 PM Chris Norman notifications@github.com wrote:

This style of pathname works for reading the cram file which is on the hdfs file system.

@jjfarrell https://github.com/jjfarrell Do you mean that GATK can read the CRAM file using a ("hdfs://") path with no auth component ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/5942?email_source=notifications&email_token=AAO6WDOTXGPSLXOPBIIM6STPWLQSRA5CNFSM4HNE2MUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZS5ZQ#issuecomment-494087910, or mute the thread https://github.com/notifications/unsubscribe-auth/AAO6WDKUXPCH2H3OFSRGOXTPWLQSRANCNFSM4HNE2MUA .

-- John Farrell, Ph.D. Biomedical Genetics-Evans 218 Boston University Medical School 72 East Concord Street Boston, MA

ph: 617-358-3562 (New Number)

SHuang-Broad commented 5 years ago

@jjfarrell

After talking with @cmnbroad this afternoon, we'd like to ask you to perform an experiment to limit the scope where hunt down the issue.

Is it possible for you to run PrintReadsSpark on the same cluster? That is, something similar to

gatk --java-options "-Djava.io.tmpdir=tmp" \
  PrintReadsSpark \
  -R $REF \
  -I $CRAM_DIR/$SAMPLE.cram \
  -O hdfs:///project/casa/gcad/$CENTER/sv/$SAMPLE.cram \
  -- \
  --spark-runner SPARK \
  --spark-master yarn \
  --deploy-mode client  \
  --executor-memory 85G \
  --driver-memory 30g \
  --num-executors 40 \
  --executor-cores 4 \
  --conf spark.yarn.submit.waitAppCompletion=false \
  --name "$SAMPLE" \
  --conf spark.yarn.executor.memoryOverhead=5000  \
  --conf spark.network.timeout=600  \
  --conf spark.executor.heartbeatInterval=120
jjfarrell commented 5 years ago

@SHuang-Broad

Here are the tail end of the PrintReadsSpark jobs (tested on bam, cram and sam filetypes)

Writing to a bam failed while writing bai index... (others worked while writing to a cram and sam)

2019-06-03 22:34:34 INFO  IndexFileMerger:100 - Done merging .sbi files
2019-06-03 22:34:34 INFO  IndexFileMerger:69 - Merging .bai files in temp directory hdfs:///project/casa/gcad/adsp.cc/sv/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.bam.parts/ to hdfs:///project/casa/gcad/adsp.cc/sv/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.bam.bai
2019-06-03 22:34:48 INFO  AbstractConnector:318 - Stopped Spark@6be766d1{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-06-03 22:34:48 INFO  SparkUI:54 - Stopped Spark web UI at http://scc-hadoop.bu.edu:4040
2019-06-03 22:34:48 INFO  YarnClientSchedulerBackend:54 - Interrupting monitor thread
2019-06-03 22:34:48 INFO  YarnClientSchedulerBackend:54 - Shutting down all executors
2019-06-03 22:34:48 INFO  YarnSchedulerBackend$YarnDriverEndpoint:54 - Asking each executor to shut down
2019-06-03 22:34:48 INFO  SchedulerExtensionServices:54 - Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
2019-06-03 22:34:48 INFO  YarnClientSchedulerBackend:54 - Stopped
2019-06-03 22:34:48 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-06-03 22:34:49 INFO  MemoryStore:54 - MemoryStore cleared
2019-06-03 22:34:49 INFO  BlockManager:54 - BlockManager stopped
2019-06-03 22:34:49 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2019-06-03 22:34:49 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-06-03 22:34:49 INFO  SparkContext:54 - Successfully stopped SparkContext
22:34:49.027 INFO  PrintReadsSpark - Shutting down engine
[June 3, 2019 10:34:49 PM EDT] org.broadinstitute.hellbender.tools.spark.pipelines.PrintReadsSpark done. Elapsed time: 3.72 minutes.
Runtime.totalMemory()=3829923840
htsjdk.samtools.util.RuntimeIOException: java.io.IOException: Stream closed
        at htsjdk.samtools.IndexStreamBuffer.readFully(IndexStreamBuffer.java:23)
        at htsjdk.samtools.IndexStreamBuffer.readInteger(IndexStreamBuffer.java:56)
        at htsjdk.samtools.AbstractBAMFileIndex.readInteger(AbstractBAMFileIndex.java:432)
        at htsjdk.samtools.AbstractBAMFileIndex.query(AbstractBAMFileIndex.java:272)
        at htsjdk.samtools.CachingBAMFileIndex.getQueryResults(CachingBAMFileIndex.java:159)
        at htsjdk.samtools.BAMIndexMerger.processIndex(BAMIndexMerger.java:43)
        at htsjdk.samtools.BAMIndexMerger.processIndex(BAMIndexMerger.java:16)
        at org.disq_bio.disq.impl.file.IndexFileMerger.mergeParts(IndexFileMerger.java:90)
        at org.disq_bio.disq.impl.formats.bam.BamSink.save(BamSink.java:132)
        at org.disq_bio.disq.HtsjdkReadsRddStorage.write(HtsjdkReadsRddStorage.java:225)
        at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:155)
        at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:120)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:361)
        at org.broadinstitute.hellbender.tools.spark.pipelines.PrintReadsSpark.runTool(PrintReadsSpark.java:35)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:528)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Stream closed
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:829)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:889)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.disq_bio.disq.impl.file.HadoopFileSystemWrapper$SeekableHadoopStream.read(HadoopFileSystemWrapper.java:232)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        at htsjdk.samtools.seekablestream.SeekableBufferedStream.read(SeekableBufferedStream.java:133)
        at htsjdk.samtools.IndexStreamBuffer.readFully(IndexStreamBuffer.java:21)
        ... 31 more
2019-06-03 22:34:49 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-06-03 22:34:49 INFO  ShutdownHookManager:54 - Deleting directory /restricted/projectnb/casa/wgs.hg38/sv/gatk.sv/tmp/spark-5bc02b27-fba3-40b7-8532-911b658b4698
2019-06-03 22:34:49 INFO  ShutdownHookManager:54 - Deleting directory /restricted/projectnb/casa/wgs.hg38/sv/gatk.sv/tmp/spark-998b56f6-e759-403c-8e27-48b3bdf90def

hadoop fs -ls /project/casa/gcad/adsp.cc/sv/* -rw-r--r-- 3 farrell casa 68081502795 2019-06-03 22:34 /project/casa/gcad/adsp.cc/sv/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.bam -rw-r--r-- 3 farrell casa 0 2019-06-03 22:34 /project/casa/gcad/adsp.cc/sv/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.bam.bai Found 1 items -rw-r--r-- 3 farrell casa 0 2019-06-03 22:34 /project/casa/gcad/adsp.cc/sv/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.bam.parts/_SUCCESS -rw-r--r-- 3 farrell casa 1684348 2019-06-03 22:34 /project/casa/gcad/adsp.cc/sv/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.bam.sbi

Below is writing to a cram. It looks like it worked.

2019-06-03 22:45:09 INFO  TaskSetManager:54 - Finished task 140.0 in stage 0.0 (TID 159) in 137438 ms on scc-q07.scc.bu.edu (executor 26) (183/189)
2019-06-03 22:45:11 INFO  TaskSetManager:54 - Finished task 169.0 in stage 0.0 (TID 174) in 136322 ms on scc-q03.scc.bu.edu (executor 7) (184/189)
2019-06-03 22:45:14 INFO  TaskSetManager:54 - Finished task 137.0 in stage 0.0 (TID 142) in 152140 ms on scc-q11.scc.bu.edu (executor 6) (185/189)
2019-06-03 22:45:14 INFO  TaskSetManager:54 - Finished task 173.0 in stage 0.0 (TID 178) in 137708 ms on scc-q08.scc.bu.edu (executor 27) (186/189)
2019-06-03 22:45:16 INFO  TaskSetManager:54 - Finished task 153.0 in stage 0.0 (TID 166) in 142455 ms on scc-q08.scc.bu.edu (executor 27) (187/189)
2019-06-03 22:45:16 INFO  TaskSetManager:54 - Finished task 159.0 in stage 0.0 (TID 172) in 142401 ms on scc-q08.scc.bu.edu (executor 27) (188/189)
2019-06-03 22:45:33 INFO  TaskSetManager:54 - Finished task 150.0 in stage 0.0 (TID 184) in 153045 ms on scc-q10.scc.bu.edu (executor 3) (189/189)
2019-06-03 22:45:33 INFO  DAGScheduler:54 - ResultStage 0 (runJob at SparkHadoopWriter.scala:78) finished in 301.853 s
2019-06-03 22:45:33 INFO  YarnScheduler:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2019-06-03 22:45:33 INFO  DAGScheduler:54 - Job 0 finished: runJob at SparkHadoopWriter.scala:78, took 302.340057 s
2019-06-03 22:45:35 INFO  SparkHadoopWriter:54 - Job job_20190603224030_0014 committed.
2019-06-03 22:45:35 INFO  AbstractConnector:318 - Stopped Spark@6be766d1{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-06-03 22:45:35 INFO  SparkUI:54 - Stopped Spark web UI at http://scc-hadoop.bu.edu:4040
2019-06-03 22:45:35 INFO  YarnClientSchedulerBackend:54 - Interrupting monitor thread
2019-06-03 22:45:35 INFO  YarnClientSchedulerBackend:54 - Shutting down all executors
2019-06-03 22:45:35 INFO  YarnSchedulerBackend$YarnDriverEndpoint:54 - Asking each executor to shut down
2019-06-03 22:45:35 INFO  SchedulerExtensionServices:54 - Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
2019-06-03 22:45:35 INFO  YarnClientSchedulerBackend:54 - Stopped
2019-06-03 22:45:35 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-06-03 22:45:35 INFO  MemoryStore:54 - MemoryStore cleared
2019-06-03 22:45:35 INFO  BlockManager:54 - BlockManager stopped
2019-06-03 22:45:35 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2019-06-03 22:45:35 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-06-03 22:45:35 INFO  SparkContext:54 - Successfully stopped SparkContext
22:45:35.933 INFO  PrintReadsSpark - Shutting down engine
[June 3, 2019 10:45:35 PM EDT] org.broadinstitute.hellbender.tools.spark.pipelines.PrintReadsSpark done. Elapsed time: 5.79 minutes.
Runtime.totalMemory()=4147118080
2019-06-03 22:45:35 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-06-03 22:45:35 INFO  ShutdownHookManager:54 - Deleting directory /restricted/projectnb/casa/wgs.hg38/sv/gatk.sv/tmp/spark-423d02dc-cbc1-4c83-907d-ca315ca231bc
2019-06-03 22:45:35 INFO  ShutdownHookManager:54 - Deleting directory /restricted/projectnb/casa/wgs.hg38/sv/gatk.sv/tmp/spark-c035847e-6113-48f1-b5d1-66184925be7d

$ hadoop fs -ls /project/casa/gcad/adsp.cc/sv/*

-rw-r--r-- 3 farrell casa 1684348 2019-06-03 22:34 /project/casa/gcad/adsp.cc/sv/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.bam.sbi -rw-r--r-- 3 farrell casa 27494132363 2019-06-03 22:45 /project/casa/gcad/adsp.cc/sv/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.cram

Writing to a sam worked without triggering error....

2019-06-03 22:59:08 INFO  TaskSetManager:54 - Finished task 183.0 in stage 0.0 (TID 181) in 106230 ms on scc-q04.scc.bu.edu (executor 23) (162/189)
2019-06-03 22:59:08 INFO  TaskSetManager:54 - Finished task 154.0 in stage 0.0 (TID 153) in 138243 ms on scc-q01.scc.bu.edu (executor 18) (163/189)
2019-06-03 22:59:09 INFO  TaskSetManager:54 - Finished task 142.0 in stage 0.0 (TID 147) in 144577 ms on scc-q09.scc.bu.edu (executor 29) (164/189)
2019-06-03 22:59:10 INFO  TaskSetManager:54 - Finished task 136.0 in stage 0.0 (TID 138) in 195420 ms on scc-q09.scc.bu.edu (executor 16) (165/189)
2019-06-03 22:59:12 INFO  TaskSetManager:54 - Finished task 146.0 in stage 0.0 (TID 152) in 142642 ms on scc-q06.scc.bu.edu (executor 28) (166/189)
2019-06-03 22:59:14 INFO  TaskSetManager:54 - Finished task 162.0 in stage 0.0 (TID 156) in 138765 ms on scc-q06.scc.bu.edu (executor 28) (167/189)
2019-06-03 22:59:16 INFO  TaskSetManager:54 - Finished task 151.0 in stage 0.0 (TID 146) in 153347 ms on scc-q03.scc.bu.edu (executor 25) (168/189)
2019-06-03 22:59:17 INFO  TaskSetManager:54 - Finished task 181.0 in stage 0.0 (TID 186) in 113859 ms on scc-q03.scc.bu.edu (executor 25) (169/189)
2019-06-03 22:59:18 INFO  TaskSetManager:54 - Finished task 174.0 in stage 0.0 (TID 170) in 124158 ms on scc-q04.scc.bu.edu (executor 23) (170/189)
2019-06-03 22:59:21 INFO  TaskSetManager:54 - Finished task 182.0 in stage 0.0 (TID 184) in 117914 ms on scc-q07.scc.bu.edu (executor 24) (171/189)
2019-06-03 22:59:23 INFO  TaskSetManager:54 - Finished task 170.0 in stage 0.0 (TID 168) in 134113 ms on scc-q04.scc.bu.edu (executor 23) (172/189)
2019-06-03 22:59:23 INFO  TaskSetManager:54 - Finished task 150.0 in stage 0.0 (TID 135) in 220358 ms on scc-q12.scc.bu.edu (executor 17) (173/189)
2019-06-03 22:59:24 INFO  TaskSetManager:54 - Finished task 160.0 in stage 0.0 (TID 157) in 147729 ms on scc-q11.scc.bu.edu (executor 20) (174/189)
2019-06-03 22:59:30 INFO  TaskSetManager:54 - Finished task 153.0 in stage 0.0 (TID 158) in 152899 ms on scc-q13.scc.bu.edu (executor 1) (175/189)
2019-06-03 22:59:37 INFO  TaskSetManager:54 - Finished task 172.0 in stage 0.0 (TID 183) in 133959 ms on scc-q14.scc.bu.edu (executor 21) (176/189)
2019-06-03 22:59:40 INFO  TaskSetManager:54 - Finished task 137.0 in stage 0.0 (TID 142) in 189085 ms on scc-q07.scc.bu.edu (executor 5) (177/189)
2019-06-03 22:59:40 INFO  TaskSetManager:54 - Finished task 156.0 in stage 0.0 (TID 164) in 156564 ms on scc-q07.scc.bu.edu (executor 24) (178/189)
2019-06-03 22:59:41 INFO  TaskSetManager:54 - Finished task 169.0 in stage 0.0 (TID 172) in 146702 ms on scc-q12.scc.bu.edu (executor 17) (179/189)
2019-06-03 22:59:44 INFO  TaskSetManager:54 - Finished task 157.0 in stage 0.0 (TID 161) in 164706 ms on scc-q05.scc.bu.edu (executor 9) (180/189)
2019-06-03 22:59:45 INFO  TaskSetManager:54 - Finished task 159.0 in stage 0.0 (TID 162) in 164547 ms on scc-q02.scc.bu.edu (executor 22) (181/189)
2019-06-03 22:59:46 INFO  TaskSetManager:54 - Finished task 148.0 in stage 0.0 (TID 145) in 183240 ms on scc-q03.scc.bu.edu (executor 6) (182/189)
2019-06-03 22:59:48 INFO  TaskSetManager:54 - Finished task 155.0 in stage 0.0 (TID 151) in 178814 ms on scc-q03.scc.bu.edu (executor 25) (183/189)
2019-06-03 22:59:49 INFO  TaskSetManager:54 - Finished task 167.0 in stage 0.0 (TID 175) in 153665 ms on scc-q16.scc.bu.edu (executor 2) (184/189)
2019-06-03 22:59:51 INFO  TaskSetManager:54 - Finished task 180.0 in stage 0.0 (TID 173) in 156301 ms on scc-q07.scc.bu.edu (executor 24) (185/189)
2019-06-03 22:59:58 INFO  TaskSetManager:54 - Finished task 171.0 in stage 0.0 (TID 169) in 166183 ms on scc-q04.scc.bu.edu (executor 4) (186/189)
2019-06-03 23:00:00 INFO  TaskSetManager:54 - Finished task 168.0 in stage 0.0 (TID 177) in 161852 ms on scc-q02.scc.bu.edu (executor 22) (187/189)
2019-06-03 23:00:02 INFO  TaskSetManager:54 - Finished task 173.0 in stage 0.0 (TID 182) in 158461 ms on scc-q02.scc.bu.edu (executor 22) (188/189)
2019-06-03 23:00:06 INFO  TaskSetManager:54 - Finished task 152.0 in stage 0.0 (TID 155) in 192765 ms on scc-q07.scc.bu.edu (executor 5) (189/189)
2019-06-03 23:00:06 INFO  DAGScheduler:54 - ResultStage 0 (runJob at SparkHadoopWriter.scala:78) finished in 374.804 s
2019-06-03 23:00:06 INFO  YarnScheduler:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2019-06-03 23:00:06 INFO  DAGScheduler:54 - Job 0 finished: runJob at SparkHadoopWriter.scala:78, took 375.304795 s
2019-06-03 23:00:08 INFO  SparkHadoopWriter:54 - Job job_20190603225351_0015 committed.
2019-06-03 23:00:09 INFO  AbstractConnector:318 - Stopped Spark@6be766d1{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-06-03 23:00:09 INFO  SparkUI:54 - Stopped Spark web UI at http://scc-hadoop.bu.edu:4040
2019-06-03 23:00:09 INFO  YarnClientSchedulerBackend:54 - Interrupting monitor thread
2019-06-03 23:00:09 INFO  YarnClientSchedulerBackend:54 - Shutting down all executors
2019-06-03 23:00:09 INFO  YarnSchedulerBackend$YarnDriverEndpoint:54 - Asking each executor to shut down
2019-06-03 23:00:09 INFO  SchedulerExtensionServices:54 - Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
2019-06-03 23:00:09 INFO  YarnClientSchedulerBackend:54 - Stopped
2019-06-03 23:00:09 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-06-03 23:00:09 INFO  MemoryStore:54 - MemoryStore cleared
2019-06-03 23:00:09 INFO  BlockManager:54 - BlockManager stopped
2019-06-03 23:00:09 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2019-06-03 23:00:09 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-06-03 23:00:09 INFO  SparkContext:54 - Successfully stopped SparkContext
23:00:09.356 INFO  PrintReadsSpark - Shutting down engine
[June 3, 2019 11:00:09 PM EDT] org.broadinstitute.hellbender.tools.spark.pipelines.PrintReadsSpark done. Elapsed time: 7.01 minutes.
Runtime.totalMemory()=4327997440
2019-06-03 23:00:09 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-06-03 23:00:09 INFO  ShutdownHookManager:54 - Deleting directory /restricted/projectnb/casa/wgs.hg38/sv/gatk.sv/tmp/spark-73067845-b641-4212-9c81-51e8d6aa9f31
2019-06-03 23:00:09 INFO  ShutdownHookManager:54 - Deleting directory /restricted/projectnb/casa/wgs.hg38/sv/gatk.sv/tmp/spark-b4b61d51-d75f-45e5-9e10-8171d3acea1d

hadoop fs -ls /project/casa/gcad/adsp.cc/sv/*sam -rw-r--r-- 3 farrell casa 389867305631 2019-06-03 23:00 /project/casa/gcad/adsp.cc/sv/A-ACT-AC000014-BL-NCR-15AD78694.hg38.realign.bqsr.sam

jjfarrell commented 5 years ago

Any updates on this issue?