bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

Running PureCN through bcbio #3230

Closed waemm closed 3 years ago

waemm commented 4 years ago

Hi guys,

I've been trying to run purecn through the bcbio workflow, there is very little documentation on how this works or whether I can provide mapping bias or panel of normal data. Some questions I had:

  1. From looking at the script it looks like a mapping bias RDS is included but how? Is this automatically calculated or do you provide it through the YAML? Any further detail here would be great!

  2. Also, does this run as tumor-only or tumor-normal or can it do both?

  3. Have there been any comparisons of samples run with this workflow to the standard PureCN workflow ?

Thanks!

naumenko-sa commented 4 years ago

Hi @waemm !

Thanks for initiating this discussion!

We have re-enabled PureCN recently (in bcbio 1.2.1), created r36 environment for it, and we definitely need better documentation and validation of PureCN in bcbio to make sure we are running it properly.

My understanding is that currently PureCN receives segmentation information from either cnvkit or gatk-cnv, so one needs to include one of these callers in the yaml to be able to run PureCN.

--rds parameters in the PureCN call is to save PureCN data to make plots, it is not a mapping bias RDS.

PureCN runs on tumor only samples (and it was designed to do so!), but in bcbio in a test run I have it reports that input QC failed:

Cannot find valid purity/ploidy solution. This happens when input 
segmentations are garbage, most likely due to a catastrophic sample QC 
failure. Re-check standard QC metrics for this sample. 

This is most likely a user error due to invalid input data or 
parameters (PureCN 1.16.0). 
Error: Cannot find valid purity/ploidy solution. This happens when input 
segmentations are garbage, most likely due to a catastrophic sample QC 
failure. Re-check standard QC metrics for this sample. 
This is most likely a user error due to invalid input data or 
parameters (PureCN 1.16.0). 

I think it is because in bcbio we rely on cnvkit or gatk-cnv input for PureCN, and these tools are better suited for T/N case, so there should be a way to better utilize PureCN's potential.

There are definitely better experts on PureCN in bcbio community, please chime in @lbeltrame. @ohofmann maybe you could nominate somebody?

Sergey

lbeltrame commented 4 years ago

Sorry for the delay in the response. Ideally PureCN would need a process-matched control DB instead of matched samples, because it is far better with a group of samples (also in my personal experience), but in this case, it needs to perform the segmentation itself, and not rely on GATK/CNVkit, IIRC. @lima1 can clarify this, I think.

In that case, we'd need to create on and off target data using PureCN's scripts, then pool samples classified as background, extract coverage information (again PureCN), run GATK4 for germline variants on the pool of normals, and then combine everything together. We can either do that ourselves, or ask the users to do it outside bcbio.

lima1 commented 4 years ago

Hi everybody,

the default PureCN segmentation/normalization is pretty much the same as GATK4, with support for off-target and sex chromosomes. CNVkit normalized with a reference usually provides similar results in my experience.

The mapping bias file looks at allelic fractions of germline SNPs in the pool of normals. Similar to recent Mutect2 (they do it for artifacts though), it models beta-binomial distributions of allelic fractions to flag variants with large deviation from the expected 0.5 and to calculate p-values of unbalanced-ness of SNPs in tumor. Building the mapping bias file requires merging VCFs from the normal samples into a multi-sample VCF. This is something I plan to make a bit easier, but not sure it will happen in the next months.

Sergey, feel free to send me output log-files. Yes, this indeed happens when there are issues with the setup and the data are mostly noise. I've never seen those failures in our data.

Markus

waemm commented 4 years ago

Thanks @lbeltrame, I was wondering if bcbio had support for the steps you describe built in but for sure I am currently doing this outside of bcbio workflow.

roryk commented 4 years ago

Thanks everyone for the good discussion about this, we definitely should implement doing this properly inside of bcbio itself, so I opened the issue back up.

lima1 commented 4 years ago

@naumenko-sa , not sure that's related, but a user posted a bug report and the issue was a significant number of tumor vs normal log-ratios outliers in the CNVkit cnr coverage file. PureCN is now ignoring all log-ratios < -8.

naumenko-sa commented 4 years ago

Hi @waemm, @lbeltrame , and everyone interested in running purecn in bcbio!

I've merged purecn functionality and docs, please give it a try and we appreciate your feedback! https://github.com/bcbio/bcbio-nextgen/pull/3364 https://bcbio-nextgen.readthedocs.io/en/latest/contents/purecn.html

Sergey

lbeltrame commented 4 years ago

@naumenko-sa I left a couple of comments on the PR. Tomorrow I'll go through it and see if I find anything else.

naumenko-sa commented 3 years ago

PureCN PON creation runs OK in hg19 now. the calling step still needs gnomad_af_only resource. I'm building it since it does not look available from the Broad. https://gatk.broadinstitute.org/hc/en-us/community/posts/360058276951-Which-file-is-af-only-gnomad-hg38-vcf-gz-

naumenko-sa commented 3 years ago

we successfully ran a production size (149 T/N pairs + normal db) project with purecn in bcbio (hg38).

some tips:

Let us know how it is behaving in the other installations.

lbeltrame commented 3 years ago

Before the Christmas holidays we ran a batch of about 30 samples with no issues. In a few days we'll be testing with around 80, but so far the runs have been very smooth.

I would suggest to investigate whether it makes sense to re-enable Dx.R runs (perhaps optionally), because it gives additional and useful information like copy number burden, and estimations on the prevalence of the trinucleotide mutational signatures (as described in the Alexandrov paper from 2011, IIRC).

naumenko-sa commented 3 years ago

Hi @lbeltrame !

Mutation signatures are available in bcbio1.2.8 when running pureCN w PON. It would need a new 1.9.0 r-deconstructsigs for that https://github.com/bioconda/bioconda-recipes/tree/master/recipes/r-deconstructsigs

Let us know if you see any issues with that!

Sergey

xuwanxing commented 3 years ago

Hi all, I tried to run PureCN through bcbio but the program suddenly halted. Can anyone help me with this problem? Thanks a lot! Command: bcbio_nextgen.py ../config/pon.yaml -n 20 Log: ... [2021-05-11T09:34Z] Running: [2021-05-11T09:34Z] java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms681m -Xmx3181m -Djava.io.tmpdir=/testPureCN/4-30/bcbio_nextgen/pon/work/bcbiotx/tmpirt0kj7o -jar /Software/bcbio-nextgen/current/anaconda/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar ConvertSequencingArtifactToOxoG --INPUT_BASE /testPureCN/4-30/bcbio_nextgen/pon/work/metrics/artifact/PBMC10/PBMC10 -O /testPureCN/4-30/bcbio_nextgen/pon/work/bcbiotx/tmp11h6qr44/PBMC10/PBMC10 -R /Software/bcbio-nextgen/current/genomes/Hsapiens/hg38/seq/hg38.fa [2021-05-11T09:34Z] Fixing /testPureCN/4-30/bcbio_nextgen/pon/work/bcbiotx/tmp11h6qr44/PBMC10/PBMC10.oxog_metrics to work with MultiQC. Traceback (most recent call last): File "/Software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 245, in main(kwargs) File "/Software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 46, in main run_main(kwargs) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 50, in run_main fc_dir, run_info_yaml) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 91, in _run_toplevel for xs in pipeline(config, run_info_yaml, parallel, dirs, samples): File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 138, in variant2pipeline samples = run_parallel("postprocess_alignment", samples) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore for data in joblib.Parallel(parallel["num_jobs"], batch_size=1, backend="multiprocessing")(joblib.delayed(fn)(x) for x in items): File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 1041, in call if self.dispatch_one_batch(iterator): File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch self._dispatch(tasks) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 777, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 208, in apply_async result = ImmediateResult(func) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 572, in init self.results = batch() File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 263, in call for func, args, kwargs in self.items] File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 263, in for func, args, kwargs in self.items] File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/utils.py", line 59, in wrapper return f(args, *kwargs) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/distributed/multitasks.py", line 157, in postprocess_alignment return sample.postprocess_alignment(args) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/pipeline/sample.py", line 281, in postprocess_alignment covinfo = callable.sample_callable_bed(bam_file_ready, ref_file, data) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/bam/callable.py", line 45, in sample_callable_bed callable_bed, depth_files = coverage.calculate(bam_file, data, sv_bed) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/variation/coverage.py", line 101, in calculate per_base=per_base) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/variation/coverage.py", line 243, in run_mosdepth mosdepth = config_utils.get_program("mosdepth", data) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/pipeline/config_utils.py", line 191, in get_program return _get_program_cmd(name, pconfig, config, default) File "/Software/bcbio-nextgen/current/anaconda/lib/python3.6/site-packages/bcbio/pipeline/config_utils.py", line 217, in wrap raise CmdNotFound(" ".join(map(repr, (fn.name if six.PY3 else fn.func_name, name, pconfig, default)))) bcbio.pipeline.config_utils.CmdNotFound: '_get_program_cmd' 'mosdepth' {} None

naumenko-sa commented 3 years ago

sorry @xuwanxing , I missed that message.

The error is that bcbio can't find mosdepth program, so your installation is incomplete. Please try to update bcbio_nextgen.py upgrade -u skip --tools and make sure your PATH variable is set as suggested in the docs. You should see the output of which mosdepth then.

Please open a new issue if it is not working.

SN