bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
992 stars 354 forks source link

PureCN in paired mode (wrongly) requires CNVkit to run #3497

Open lbeltrame opened 3 years ago

lbeltrame commented 3 years ago

Version info

To Reproduce Exact bcbio command you have used:

bcbio_nextgen.py -t local -n 1 /path/to/bcbio_system.yaml /path/to/configuration/config.yaml

Your sample configuration file:

details:
- algorithm:
    aligner: bwa
    coverage: my_regions
    effects: vep
    ensemble:
      numpass: 1
    exclude_regions:
    - polyx
    - lcr
    - altcontigs
    mark_duplicates: true
    platform: illumina
    quality_format: Standard
    realign: false
    recalibrate: false
    svcaller:
    - purecn
    variant_regions: my_regions
    variantcaller:
    - vardict
    - mutect2
  analysis: variant2
  description: Sample1
  files:
  - /mnt/data/fastq/Sample1_R1.fastq.gz
  - /mnt/data/fastq/Sample1_R2.fastq.gz
  genome_build: hg38
  metadata:
    batch: Sample1_vs_control
    kind: tissue
    phenotype: tumor
- algorithm:
    aligner: bwa
    coverage: my_regions
    effects: vep
    ensemble:
      numpass: 1
    exclude_regions:
    - polyx
    - lcr
    - altcontigs
    mark_duplicates: true
    platform: illumina
    quality_format: Standard
    realign: false
    recalibrate: false
    svcaller:
    - purecn
    variant_regions: my_regions
    variantcaller:
    - vardict
    - mutect2
  analysis: variant2
  description: Control1
  files:
  - /mnt/data/fastq/Control1_R1.fastq.gz
  - /mnt/data/fastq/Control1_R2.fastq.gz
  genome_build: hg38
  metadata:
    batch: Sample1_vs_control
    kind: tissue
    phenotype: normal

Observed behavior Error message or bcbio output:

Traceback (most recent call last):
  File "/home/share/bcbio-tools/bin/bcbio_nextgen.py", line 245, in <module>
    main(**kwargs)
  File "/home/share/bcbio-tools/bin/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 50, in run_main
    fc_dir, run_info_yaml)
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 91, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 179, in variant2pipeline    samples = structural.run(samples, run_parallel, "standard")
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/structural/__init__.py", line 228, in run
    for xs in to_process.values()))
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore    for data in joblib.Parallel(parallel["num_jobs"], batch_size=1, backend="multiprocessing")(joblib.delayed(fn)(*x) for x in items):
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 208, in apply_asyn
c
    result = ImmediateResult(func)
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/utils.py", line 59, in wrapper
    return f(*args, **kwargs)
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multitasks.py", line 359, in detect_
sv
    return structural.detect_sv(*args)
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/structural/__init__.py", line 254, in detect_sv
    for svdata in caller_fn(items):
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/structural/purecn.py", line 37, in run
    purecn_out = _run_purecn(paired, work_dir)
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/structural/purecn.py", line 153, in _run_purecn
    cnr_file, seg_file = segfns[cnvkit.bin_approach(paired.tumor_data)](cnr_file, work_dir, paired)
  File "/home/share/bcbio/anaconda/lib/python3.6/site-packages/bcbio/structural/cnvkit.py", line 48, in bin_approach
    if norm_file.endswith(("-crstandardized.tsv", "-crdenoised.tsv")):
AttributeError: 'NoneType' object has no attribute 'endswith'

Expected behavior bcbio should use PureCN's own method for paired normal in absence of a normalDB (documented in the PureCN docs).

Additional context This happens because the code assumes CNVkit will run even if it won't. And it's not even necessary to run CNVkit to have PureCN use a normal file (although it is not recommended by PureCN upstream).

lbeltrame commented 3 years ago

In fact the whole part without a normal DB cannot run because it was not ported to the new way of using Rscript and relies on GATK and CNVkit. It needs a rewrite to handle a paired normal: given the fact that I actually need this, I might get to it if anyone doesn't beat me to it.

naumenko-sa commented 3 years ago

Hi @lbeltrame !

Yes, those are remains of the old cnvkit + purecn workflow. Given that running without PON is discouraged, we are usually running PureCN T/N analyses w PON as well or just T only + PON, according to the docs: https://bcbio-nextgen.readthedocs.io/en/latest/contents/purecn.html

Please fill free to PR!

SN