bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
992 stars 354 forks source link

bcbio CWL not detecting GATK3 version #3095

Closed matthdsm closed 4 years ago

matthdsm commented 4 years ago

Hi,

I'm running the germline variant pipeline through CWL and cromwell, but it seems the CWL invocations of the bcbio codebase don't detect the GATK version correctly, causing the pipeline to stall.

Running the pipeline using the following cmd

bcbio_vm.py cwlrun cromwell ../bcbio-workflow --no-container -q default -s torque -r walltime=360:00:00 --joblimit 20

with config:

---
fc_name:
upload:
  dir: ../final
globals:
  analysis_regions: WES_analysis_ROI_v2.bed
  coverage_regions: WES_analysis_ROI_v2.bed
resources:
  tmp:
    dir: /tmp/bcbio
details:
  - analysis: variant2
    genome_build: hg38
    description:
    metadata:
      batch:
      ped:
    algorithm:
      aligner: bwa
      save_diskspace: true
      coverage_interval: regional
      coverage: coverage_regions
      mark_duplicates: true
      recalibrate: false
      realign: false
      variantcaller: gatk-haplotype
      variant_regions: analysis_regions
      jointcaller: gatk-haplotype-joint
      effects: vep
      effects_transcripts: all
      vcfanno: [dbscsnv,dbnsfp]
      tools_on:
        - vep_splicesite_annotations
        - gemini
        - coverage_perbase
        - picard
      tools_off:
        - gatk4
      archive: cram-lossless
    files:

and error:

Job variantcall_batch_region:1:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: /home/projects/bcbio_annotation/exomes/NVQExomes/NVQ_RUN_055/samples_NVQ055-merged/work/cromwell_work/cromwell-executions/main-samples_NVQ055-mer
ged.cwl/4dc5a18a-a999-4d25-9c66-3244f09f45a2/call-variantcall/shard-15/wf-variantcall.cwl/0f4ad566-bdea-4629-a76f-9e9baa40a5f3/call-variantcall_batch_region/shard-1/execution/stderr.
 multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 567, in __call__
    return self.func(*args, **kwargs)
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/utils.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multitasks.py", line 284, in variantcall_sample
    return genotype.variantcall_sample(*args)
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/genotype.py", line 377, in variantcall_sample
    out_file = caller_fn(align_bams, items, ref_file, assoc_files, region, out_file)
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/gatk.py", line 117, in haplotype_caller
    "Require full version of GATK 2.4+, or GATK4 for haplotype calling"
AssertionError: Require full version of GATK 2.4+, or GATK4 for haplotype calling
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/galaxy/bcbio/tools/bin/bcbio_nextgen.py", line 223, in <module>
    runfn.process(kwargs["args"]) 
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/runfn.py", line 57, in process
    out = fn(*fnargs)
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/utils.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multitasks.py", line 272, in variantcall_batch_region
    return genotype.variantcall_batch_region(*args)
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/genotype.py", line 459, in variantcall_batch_region
    call_file = _run_variantcall_batch_multicore(items, region_block, out_file)
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/genotype.py", line 485, in _run_variantcall_batch_multicore
    "vrn_file", ["region", "sam_ref", "config"])
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/split.py", line 62, in parallel_split_combine
    split_output = parallel_fn(parallel_name, split_args)
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"], batch_size=1, backend="multiprocessing")(joblib.delayed(fn)(*x) for x in items):
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 934, in __call__
    self.retrieve()
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/galaxy/bcbio/anaconda/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
AssertionError: Require full version of GATK 2.4+, or GATK4 for haplotype calling

Note I'm using GATK3, so the version resolvement for GATK4 is untested (by me at least). I suppose this might be the same issue as fixed with #2824. I'm currently looking for a fix and will update shortly.

Cheers M

matthdsm commented 4 years ago

~If I'm not mistaken, I suppose this can be fixed by adding the encoding option here. Testing this atm.~

M

matthdsm commented 4 years ago

Also, is CWL still on the roadmap? Or has development (and support) been halted at the moment? ping @chapmanb

Cheers M

matthdsm commented 4 years ago

~so the PR above seems to have fixed the decoding issue, but now bcbio complains it can't find the references for picard~ scratch that, didn't fix anything..

M

chapmanb commented 4 years ago

Matthias; Sorry about the issue and thank you for the digging into this. Unfortunately GATK3 won't work with CWL and we didn't plan on supporting this. The trickier installation process because of it being non-free is harder to make happen in the CWL environment where you have to pre-define and bring everything into the environment, so we ended up not working on it in favor of supporting the newer and freely available GATK4 we could develop around. I would stick with standard bcbio if you want to run GATK3 analyses rather than try to work through this.

Practically, I'm not really able to devote much time to CWL development right now and I don't think it's prioritized in Rory, Sergey and Ilya's plans. Happy to help try to sort out issues but we probably won't be rolling out any significant new features. Sorry to not have as much time to work on this now.

matthdsm commented 4 years ago

Hi Brad,

Thanks for the reply! How would you feel about using the existing logic for GATK3 in the CWL implementation? We could add a disclaimer that it won't work in docker, but for local installs it should be good to go no? What do you suppose it would take to get this done? I'm not really sure about how much work this would be, but I'd be interested in devoting some time to this issue.

Cheers M

naumenko-sa commented 4 years ago

@matthdsm we would much appreciate if you put some effort in CWL, gatk3 seems not to be the best target, see CWL issues referenced in the priority list.