bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

RFC: Tumor purity/ploidy estimation methods with targeted panels #2639

Closed lbeltrame closed 4 years ago

lbeltrame commented 5 years ago

This is kind of different from the previous discussions because it specifically mentions targeted panels. Lots of methods already present in bcbio require at least exome-level coverage (TitanCNA for sure; I can't say about PureCN and AMBER/PURPLE), but there are some papers which make use of alternative approaches that allow purity estimation on at least hybrid capture targeted datasets (see http://clincancerres.aacrjournals.org/content/23/21/6708 for an application).

My digging so far has only found FACETS (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5027494/; https://github.com/mskcc/facets) which looks actively developed (last commit Dec 2018) and has also a frontend for analysis (https://github.com/wwcrc/cnv_facets) also present in bioconda. The latter also produces a VCF output, which would be useful for bcbio.

The license of FACETS is Free Software (GPL2+; found in the DESCRIPTION file in the repo) which means there are no worries with regards to usage.

I haven't tested FACETS yet but I'm opening this issue to start some discussion, given that especially for certain applications targeted approaches are chosen in place of whole genome / exome, in particular when lots of samples are part of the equation.

chapmanb commented 5 years ago

Luca; Thanks much for starting this discussion and the pointer to cnv_facets. That's a really useful frontend wrapper to make running FACETS easier. We've done comparisons of bcbio integrated TitanCNA and PureCN with both FACETS and ASCAT run outside of bcbio and found similar results, although there are lots of sample specific differences. We've been challenged at choosing a winning method without good sets of comprehensive validation sets.

Practically, PureCN does work on panels and is how we've been using it, so I think would be worth exploring with your panel data to see if it provides useful results. @lima1 has been incredibly helpful with questions as we ran comparisons so might be able to help if you run into issues.

Thanks again for the discussion.

lima1 commented 5 years ago

We are using PureCN almost exclusively on panels (2-3MB) and it was designed for tumor-only panels. We added some copy number tiling probes to our panels, but it works reasonably well without. It's important to squeeze all the information you get, for example by optimizing the off-target bin width and the size of the flanking regions. This is described in the vignette in detail.

We have a benchmark of tumor-only PureCN against tumor/normal matched Absolute and FACETS on TCGA WES data hopefully up on bioxiv in the next 2-3 weeks. PureCN is much slower than FACETS because it calculates genotype posterior probabilities for all SNVs and automatically generates results for all local optima, but for panels that shouldn't matter much.

lbeltrame commented 5 years ago

FYI, when I mean panels in this case I need to add "small panels" (~140 genes, around 400kbp): most of these things are additions to existing clinical trials so the decision was to optimize for sample size rather than for covered regions.

Speaking of that (question for @lima1), am I correct in understanding that PureCN works better if we have a larg-ish number of process matched samples as opposed to matched ones?

lima1 commented 5 years ago

If this is TruSight 170, I've heard from someone who got PureCN running with okay results. The high coverage gives you a decent number of off-target reads and good coverage in flanking regions. Experiment a little bit with these parameters. More samples will need manual curation because of the poor SNP resolution, but that's fairly straightforward.

Yes, matched normals are almost always much noisier than process matched normals for coverage normalization. For standard assays such as TruSight, you might be able to get normals from the vendor.

lbeltrame commented 5 years ago

If this is TruSight 170, I've heard from someone who got PureCN running with okay results.

No, currently we run custom panels tailored to the study (this was an addition to an existing ongoing clinical trial).

Yes, matched normals are almost always much noisier than process matched normals for coverage normalization.

This is interesting, as in my experience it's the exact opposite than with somatic variant calling, where noise with process-matched normals (although you can't use it exactly in the same fashion as PureCN does) is far higher.

lima1 commented 5 years ago

Okay, still worth a try. 400kb is small, you sequence only a bit more than 0.01% of the genome, so getting the average copy number of the whole genome (ploidy) right is tricky. But again, if you have a decent off-target coverage and squeeze the flanking regions to get as many SNPs as possible, you should get some usable results. But it will probably require manual checks for ploidy.

The matched normal usually just doesn't fit the coverage profile of the tumor that well. Hybrid capture data is noisy. But over the whole panel of normals, you can clean up the noise quite a bit. We started using a simple PCA to find the best matching normals, but the similar to CapSeg's or now GATK4's tangent normalization works slightly better, so we switched to that.

Yes, matched normal is better for filtering SNVs, ideally of course you do both, matched and pool. One of the main goals in PureCN was filtering tumor-only data. It works pretty well now. The pool gets rid of most artifacts and the PureCN allelic fraction adjustment for allele-specific copy number removes most private germline, at least in purity less than 80%, ideally less than 60%.