Closed waemm closed 3 years ago
Hi @waemm !
Thanks for initiating this discussion!
We have re-enabled PureCN recently (in bcbio 1.2.1), created r36 environment for it, and we definitely need better documentation and validation of PureCN in bcbio to make sure we are running it properly.
My understanding is that currently PureCN receives segmentation information from either cnvkit
or gatk-cnv
, so one needs to include one of these callers in the yaml to be able to run PureCN.
--rds
parameters in the PureCN call is to save PureCN data to make plots, it is not a mapping bias RDS.
PureCN runs on tumor only samples (and it was designed to do so!), but in bcbio in a test run I have it reports that input QC failed:
Cannot find valid purity/ploidy solution. This happens when input
segmentations are garbage, most likely due to a catastrophic sample QC
failure. Re-check standard QC metrics for this sample.
This is most likely a user error due to invalid input data or
parameters (PureCN 1.16.0).
Error: Cannot find valid purity/ploidy solution. This happens when input
segmentations are garbage, most likely due to a catastrophic sample QC
failure. Re-check standard QC metrics for this sample.
This is most likely a user error due to invalid input data or
parameters (PureCN 1.16.0).
I think it is because in bcbio we rely on cnvkit or gatk-cnv input for PureCN, and these tools are better suited for T/N case, so there should be a way to better utilize PureCN's potential.
There are definitely better experts on PureCN in bcbio community, please chime in @lbeltrame. @ohofmann maybe you could nominate somebody?
Sergey
Sorry for the delay in the response. Ideally PureCN would need a process-matched control DB instead of matched samples, because it is far better with a group of samples (also in my personal experience), but in this case, it needs to perform the segmentation itself, and not rely on GATK/CNVkit, IIRC. @lima1 can clarify this, I think.
In that case, we'd need to create on and off target data using PureCN's scripts, then pool samples classified as background, extract coverage information (again PureCN), run GATK4 for germline variants on the pool of normals, and then combine everything together. We can either do that ourselves, or ask the users to do it outside bcbio.
Hi everybody,
the default PureCN segmentation/normalization is pretty much the same as GATK4, with support for off-target and sex chromosomes. CNVkit normalized with a reference usually provides similar results in my experience.
The mapping bias file looks at allelic fractions of germline SNPs in the pool of normals. Similar to recent Mutect2 (they do it for artifacts though), it models beta-binomial distributions of allelic fractions to flag variants with large deviation from the expected 0.5 and to calculate p-values of unbalanced-ness of SNPs in tumor. Building the mapping bias file requires merging VCFs from the normal samples into a multi-sample VCF. This is something I plan to make a bit easier, but not sure it will happen in the next months.
Sergey, feel free to send me output log-files. Yes, this indeed happens when there are issues with the setup and the data are mostly noise. I've never seen those failures in our data.
Markus
Thanks @lbeltrame, I was wondering if bcbio had support for the steps you describe built in but for sure I am currently doing this outside of bcbio workflow.
Thanks everyone for the good discussion about this, we definitely should implement doing this properly inside of bcbio itself, so I opened the issue back up.
@naumenko-sa , not sure that's related, but a user posted a bug report and the issue was a significant number of tumor vs normal log-ratios outliers in the CNVkit cnr coverage file. PureCN is now ignoring all log-ratios < -8.
Hi @waemm, @lbeltrame , and everyone interested in running purecn in bcbio!
I've merged purecn functionality and docs, please give it a try and we appreciate your feedback! https://github.com/bcbio/bcbio-nextgen/pull/3364 https://bcbio-nextgen.readthedocs.io/en/latest/contents/purecn.html
Sergey
@naumenko-sa I left a couple of comments on the PR. Tomorrow I'll go through it and see if I find anything else.
PureCN PON creation runs OK in hg19 now. the calling step still needs gnomad_af_only resource. I'm building it since it does not look available from the Broad. https://gatk.broadinstitute.org/hc/en-us/community/posts/360058276951-Which-file-is-af-only-gnomad-hg38-vcf-gz-
we successfully ran a production size (149 T/N pairs + normal db) project with purecn in bcbio (hg38).
some tips:
#SBATCH --mem=40G
in slurm script;-t ipython -n 400 -s slurm -q core -tag "tn" -r t=2-00:00:00 -r conmem=20 --timeout 8500"
Let us know how it is behaving in the other installations.
Before the Christmas holidays we ran a batch of about 30 samples with no issues. In a few days we'll be testing with around 80, but so far the runs have been very smooth.
I would suggest to investigate whether it makes sense to re-enable Dx.R
runs (perhaps optionally), because it gives additional and useful information like copy number burden, and estimations on the prevalence of the trinucleotide mutational signatures (as described in the Alexandrov paper from 2011, IIRC).
Hi @lbeltrame !
Mutation signatures are available in bcbio1.2.8 when running pureCN w PON. It would need a new 1.9.0 r-deconstructsigs for that https://github.com/bioconda/bioconda-recipes/tree/master/recipes/r-deconstructsigs
Let us know if you see any issues with that!
Sergey
Hi all, I tried to run PureCN through bcbio but the program suddenly halted. Can anyone help me with this problem? Thanks a lot!
Command: bcbio_nextgen.py ../config/pon.yaml -n 20
Log:
...
[2021-05-11T09:34Z] Running:
[2021-05-11T09:34Z] java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms681m -Xmx3181m -Djava.io.tmpdir=/testPureCN/4-30/bcbio_nextgen/pon/work/bcbiotx/tmpirt0kj7o -jar /Software/bcbio-nextgen/current/anaconda/share/gatk4-4.2.0.0-1/gatk-package-4.2.0.0-local.jar ConvertSequencingArtifactToOxoG --INPUT_BASE /testPureCN/4-30/bcbio_nextgen/pon/work/metrics/artifact/PBMC10/PBMC10 -O /testPureCN/4-30/bcbio_nextgen/pon/work/bcbiotx/tmp11h6qr44/PBMC10/PBMC10 -R /Software/bcbio-nextgen/current/genomes/Hsapiens/hg38/seq/hg38.fa
[2021-05-11T09:34Z] Fixing /testPureCN/4-30/bcbio_nextgen/pon/work/bcbiotx/tmp11h6qr44/PBMC10/PBMC10.oxog_metrics to work with MultiQC.
Traceback (most recent call last):
File "/Software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 245, in
sorry @xuwanxing , I missed that message.
The error is that bcbio can't find mosdepth
program, so your installation is incomplete.
Please try to update bcbio_nextgen.py upgrade -u skip --tools
and make sure your PATH variable is set as suggested in the docs.
You should see the output of which mosdepth
then.
Please open a new issue if it is not working.
SN
Hi guys,
I've been trying to run purecn through the bcbio workflow, there is very little documentation on how this works or whether I can provide mapping bias or panel of normal data. Some questions I had:
From looking at the script it looks like a mapping bias RDS is included but how? Is this automatically calculated or do you provide it through the YAML? Any further detail here would be great!
Also, does this run as tumor-only or tumor-normal or can it do both?
Have there been any comparisons of samples run with this workflow to the standard PureCN workflow ?
Thanks!