HaplotypeCaller fails with CWL

amizeranschi commented 5 years ago

Hello,

As recommended in https://github.com/bcbio/bcbio-nextgen/issues/2689#issuecomment-465955157, I tried using GATK4 with CWL (Cromwell) for calling variants, using the population calling method (as opposed to joint calling).

The sample config file and the error that I get are below. It seems like the error (java.net.UnknownHostException: asas: asas: Name or service not known) is mentioning my local hostname (asas), for some reason.

Other input files that I used can be found in this archive: https://github.com/bcbio/bcbio-nextgen/files/2884474/platypus-error.tar.gz

These are the commands used to set up and run the analysis:

bcbio_vm.py cwl --systemconfig /export/home/ncit/external/a.mizeranschi/bcbio_nextgen/galaxy/bcbio_system-hg38.yaml ../config/testingVC-merged.yaml

bcbio_vm.py cwlrun cromwell testingVC-merged-workflow --no-container

Note: I'm using a local bcbio install and bcbio_vm.py was installed with bcbio_nextgen.py upgrade -u development --data --tools --cwl.

This is the sample config file:

details:
- algorithm:
    aligner: bwa
    effects: false
    mark_duplicates: true
    ploidy: 2
    realign: false
    recalibrate: false
    tools_off:
    - gemini
    - contamination
    - peddy
    variant_regions: ../config/variant_regions.bed
    variantcaller: gatk-haplotype
  analysis: variant2
  description: HG00096
  files:
  - /export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/HG00096_R1.fastq.gz
  - /export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/HG00096_R2.fastq.gz
  genome_build: hg38
  metadata:
    batch: testingVC
- algorithm:
    aligner: bwa
    effects: false
    mark_duplicates: true
    ploidy: 2
    realign: false
    recalibrate: false
    tools_off:
    - gemini
    - contamination
    - peddy
    variant_regions: ../config/variant_regions.bed
    variantcaller: gatk-haplotype
  analysis: variant2
  description: HG00097
  files:
  - /export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/HG00097_R1.fastq.gz
  - /export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/HG00097_R2.fastq.gz
  genome_build: hg38
  metadata:
    batch: testingVC
fc_name: testingVC-merged
upload:
  dir: ../final

And the error is the following:

11:39:22.539 INFO  HaplotypeCallerSpark - ------------------------------------------------------------
11:39:22.540 INFO  HaplotypeCallerSpark - The Genome Analysis Toolkit (GATK) v4.1.0.0
11:39:22.540 INFO  HaplotypeCallerSpark - For support and documentation go to https://software.broadinstitute.org/gatk/
11:39:22.542 WARN  HaplotypeCallerSpark - 
   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   Warning: HaplotypeCallerSpark is a BETA tool and is not yet ready for use in production
   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
11:39:22.542 INFO  HaplotypeCallerSpark - Initializing engine
11:39:22.542 INFO  HaplotypeCallerSpark - Done initializing engine
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/02/22 11:39:22 INFO SparkContext: Running Spark version 2.2.0
19/02/22 11:39:22 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
11:39:23.038 INFO  HaplotypeCallerSpark - Shutting down engine
[February 22, 2019 11:39:23 AM EET] org.broadinstitute.hellbender.tools.HaplotypeCallerSpark done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=1239941120
Exception in thread "main" java.lang.ExceptionInInitializerError
    at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:546)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:373)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at org.broadinstitute.hellbender.engine.spark.SparkContextFactory.createSparkContext(SparkContextFactory.java:178)
    at org.broadinstitute.hellbender.engine.spark.SparkContextFactory.getSparkContext(SparkContextFactory.java:110)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:28)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
Caused by: java.net.UnknownHostException: asas: asas: Name or service not known
    at java.net.InetAddress.getLocalHost(InetAddress.java:1506)
    at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:891)
    at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:884)
    at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:884)
    at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
    at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.util.Utils$.localHostName(Utils.scala:941)
    at org.apache.spark.internal.config.package$.<init>(package.scala:204)
    at org.apache.spark.internal.config.package$.<clinit>(package.scala)
    ... 12 more
Caused by: java.net.UnknownHostException: asas: Name or service not known
    at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
    at java.net.InetAddress.getLocalHost(InetAddress.java:1501)
    ... 21 more
' returned non-zero exit status 1

chapmanb commented 5 years ago

Sorry about the issue. CWL calling with GATK HaplotypeCaller uses GATK4's Spark parallelization to give faster multicore runs. Spark does require that hostnames on your machine are setup correctly and resolve. It looks like you might be trying on a laptop or other local machine where the local host does not resolve? If so, fixing this networking or running on a cloud or compute machine where it does will hopefully fix the issue. Hope this helps.

amizeranschi commented 5 years ago

Thanks a lot for your reply and the pointers. Setting up the hostname in the /etc/hosts file definitely got things going, but now the result is a different error:

14:36:55.421 INFO  HaplotypeCallerSpark - Requester pays: disabled
14:36:55.422 WARN  HaplotypeCallerSpark - 
   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   Warning: HaplotypeCallerSpark is a BETA tool and is not yet ready for use in production
   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
14:36:55.422 INFO  HaplotypeCallerSpark - Initializing engine
14:36:55.422 INFO  HaplotypeCallerSpark - Done initializing engine
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/02/22 14:36:55 INFO SparkContext: Running Spark version 2.2.0
19/02/22 14:36:55 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
19/02/22 14:36:55 WARN Utils: Your hostname, asas resolves to a loopback address: 127.0.0.1; using 141.85.248.29 instead (on interface eth0)
19/02/22 14:36:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/02/22 14:36:56 INFO SparkContext: Submitted application: HaplotypeCallerSpark
19/02/22 14:36:56 INFO SecurityManager: Changing view acls to: amizeranschi
19/02/22 14:36:56 INFO SecurityManager: Changing modify acls to: amizeranschi
19/02/22 14:36:56 INFO SecurityManager: Changing view acls groups to: 
19/02/22 14:36:56 INFO SecurityManager: Changing modify acls groups to: 
19/02/22 14:36:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(amizeranschi); groups with view permissions: Set(); users  with modify permissions: Set(amizeranschi); groups with modify permissions: Set()
19/02/22 14:36:56 INFO Utils: Successfully started service 'sparkDriver' on port 38922.
19/02/22 14:36:56 INFO SparkEnv: Registering MapOutputTracker
19/02/22 14:36:57 INFO SparkEnv: Registering BlockManagerMaster
19/02/22 14:36:57 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/02/22 14:36:57 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/02/22 14:36:57 INFO DiskBlockManager: Created local directory at /export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/5c4d57ac-2e42-41ec-88fb-c803f96c7f27/call-variantcall/shard-0/wf-variantcall.cwl/fba01806-77f2-4033-a02b-1bd74632e72e/call-variantcall_batch_region/shard-0/execution/bcbiotx/tmpjNTNjr/blockmgr-3b149660-7113-4752-be93-9f38b0134711
19/02/22 14:36:57 INFO MemoryStore: MemoryStore started with capacity 20.8 GB
14:36:57.578 WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/02/22 14:36:57 INFO SparkEnv: Registering OutputCommitCoordinator
19/02/22 14:36:58 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/02/22 14:36:58 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://localhost:4040
19/02/22 14:36:58 INFO Executor: Starting executor ID driver on host localhost
19/02/22 14:36:58 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39210.
19/02/22 14:36:58 INFO NettyBlockTransferService: Server created on localhost:39210
19/02/22 14:36:58 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/02/22 14:36:58 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, localhost, 39210, None)
19/02/22 14:36:58 INFO BlockManagerMasterEndpoint: Registering block manager localhost:39210 with 20.8 GB RAM, BlockManagerId(driver, localhost, 39210, None)
19/02/22 14:36:58 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, localhost, 39210, None)
19/02/22 14:36:58 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, localhost, 39210, None)
19/02/22 14:36:58 INFO Executor: Told to re-register on heartbeat
19/02/22 14:36:58 INFO BlockManager: BlockManager BlockManagerId(driver, localhost, 39210, None) re-registering with master
19/02/22 14:36:58 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, localhost, 39210, None)
19/02/22 14:36:58 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, localhost, 39210, None)
19/02/22 14:36:58 INFO BlockManager: Reporting 0 blocks to the master.
19/02/22 14:36:58 INFO Executor: Told to re-register on heartbeat
19/02/22 14:36:58 INFO BlockManager: BlockManager BlockManagerId(driver, localhost, 39210, None) re-registering with master
19/02/22 14:36:58 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, localhost, 39210, None)
19/02/22 14:36:58 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, localhost, 39210, None)
19/02/22 14:36:58 INFO BlockManager: Reporting 0 blocks to the master.
19/02/22 14:36:58 INFO Executor: Told to re-register on heartbeat
19/02/22 14:36:58 INFO BlockManager: BlockManager BlockManagerId(driver, localhost, 39210, None) re-registering with master
19/02/22 14:36:58 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, localhost, 39210, None)
19/02/22 14:36:58 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, localhost, 39210, None)
19/02/22 14:36:58 INFO BlockManager: Reporting 0 blocks to the master.
19/02/22 14:36:59 INFO SparkUI: Stopped Spark web UI at http://localhost:4040
19/02/22 14:36:59 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/02/22 14:36:59 INFO MemoryStore: MemoryStore cleared
19/02/22 14:36:59 INFO BlockManager: BlockManager stopped
19/02/22 14:36:59 INFO BlockManagerMaster: BlockManagerMaster stopped
19/02/22 14:36:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/02/22 14:36:59 INFO SparkContext: Successfully stopped SparkContext
14:36:59.485 INFO  HaplotypeCallerSpark - Shutting down engine
[February 22, 2019 2:36:59 PM EET] org.broadinstitute.hellbender.tools.HaplotypeCallerSpark done. Elapsed time: 0.10 minutes.
Runtime.totalMemory()=1240465408
***********************************************************************
A USER ERROR has occurred: Sorry, we only support a single reads input for for this spark tool.
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
19/02/22 14:36:59 INFO ShutdownHookManager: Shutdown hook called
19/02/22 14:36:59 INFO ShutdownHookManager: Deleting directory /export/home/ncit/external/a.mizeranschi/automated-VC-test/testingVC-merged/work/cromwell_work/cromwell-executions/main-testingVC-merged.cwl/5c4d57ac-2e42-41ec-88fb-c803f96c7f27/call-variantcall/shard-0/wf-variantcall.cwl/fba01806-77f2-4033-a02b-1bd74632e72e/call-variantcall_batch_region/shard-0/execution/bcbiotx/tmpjNTNjr/spark-12057323-1c5f-42f3-a098-d5d508040ec5
' returned non-zero exit status 2
___________________________________________________________________________

[2019-02-22 14:37:04,69] [info] WorkflowManagerActor WorkflowActor-5c4d57ac-2e42-41ec-88fb-c803f96c7f27 is in a terminal state: WorkflowFailedState

What should be done with paired-end reads in this scenario (CWL with pooled VC)?

amizeranschi commented 5 years ago

I've also tested with

    variantcaller:
    - strelka2
    - freebayes
    - samtools

and

    ensemble:
      numpass: 2

and I get

chr7    141972804   rs10246939  T   C   156 PASS    AC=3;AN=4;BQB=0.676634;CALLERS=samtools,freebayes,strelka2;DP=16;DP4=4,2,7,2;HOB=0.125;ICB=0.3;MQ=60;MQ0F=0;MQB=1;MQSB=1;RPB=0.569783;SGB=-4.72062;VDB=0.872557 GT:PL:DP:AD 1/1:55,6,0:2:0,2    0/1:134,0,133:13:6,7
chr7    141972905   rs1726866   G   A   115 PASS    AC=1;AN=4;BQB=0.687289;CALLERS=samtools,freebayes,strelka2;DP=22;DP4=2,12,2,4;HOB=0.125;ICB=0.3;MQ=60;MQ0F=0;MQB=1;MQSB=1;RPB=0.418638;SGB=2.63044;VDB=0.12166  GT:PL:DP:AD 0/0:0,12,155:4:4,0  0/1:149,0,196:16:10,6
chr7    141973545   rs713598    C   G   179 PASS    AC=3;AN=4;BQB=0.92664;CALLERS=samtools,freebayes,strelka2;DP=17;DP4=5,5,4,3;HOB=0.125;ICB=0.3;MQ=60;MQ0F=0;MQB=1;MQSB=1;RPB=0.110671;SGB=-4.10598;VDB=0.125342  GT:PL:DP:AD 1/1:113,9,0:3:0,3   0/1:100,0,245:14:10,4

which is really great. It's a real shame that GATK4 isn't working as well.

chapmanb commented 5 years ago

Sorry for the continued problems. I looked into this more and remembered that we switched off using HaplotypeCallerSpark as of release 1.1.1:

https://github.com/bcbio/bcbio-nextgen/commit/d33b6f4dc4aa1add37ff40d47c514aae7850fc18

and should currently run this multicore by region rather than trying to run the Spark approach. I know when we've been talking about up to date versions but not sure if you're using a different install or older Docker container here? Hopefully a recent one will avoid using Spark altogether and get your analysis run.

amizeranschi commented 5 years ago

I'm not using a different install or any Docker container. Just a plain old local bcbio_nextgen (see: https://github.com/bcbio/bcbio-nextgen/issues/2688#issuecomment-466341358) with CWL (Cromwell).

amizeranschi commented 5 years ago

Note, the error I mentioned above only happens when running with CWL. When running bcbio_nextgen directly, HaplotypeCaller works fine and HaplotypeCallerSpark appears to be switched off.

chapmanb commented 5 years ago

Thanks much for following up and sorry for missing the underlying issue. I believe this only happens when doing CWL pooled calling and pushed a fix which should correctly avoid spark usage when pooling. If you update with:

bcbio_nextgen.py upgrade -u development

and re-run in place it should hopefully do the right thing now. Please let us know if you run into any other issues.

bcbio / bcbio-nextgen

HaplotypeCaller fails with CWL #2691