RFC: PureCN in bcbio - Githubissues

lbeltrame commented 6 years ago

More than an issue, this is mostly to collect some observations I made when testing (I still am) PureCN with bcbio outputs, and they may be useful for integration in bcbio.

PureCN is picky on the inputs it processes, in particular for coverages. Pooled cnn files from CNVkit can't be used as normals (either individual cnn files as a group, or matched cnn files) because they cause interval mismatches (admittedly, I need to retry this after tuning the off-targetr parameters)
Weighting CBS can generate NAs and this will make segmentation fail later on. EDIT: this requires multiple normals like elsewhere, that is why.
PureCN can use segments generated from CNVkit (untested on the results)
Although only VCFs from Mutect (1) are supported properly, it seems (still running) to work reasonably fine with a VCF from VarDict
Some lines in cnn files from CNVkit don't include the data on RepeatMasker (missing data?). These are outputted as \t\t and this causes issues with R's read.table used to load CNVkit files (it treats the double tab as a single field, so parsing fails). EDIT: tracked at https://github.com/lima1/PureCN/issues/38
Generating the normalDB which can be annotated with genes requires at least two normal coverage files. One isn't sufficient.

lima1 commented 6 years ago

Hi Luca, thanks for giving PureCN a try!

You can provide a single coverage file for normalization (via --normal in PureCN), but I wouldn't recommend it. If you have multiple normals, PureCN can perform GATK4's tangent normalization (adding support for offtarget and sex chromosomes). For that it needs all individual coverages to build the database.

But let me know if important input formats are missing, happy to add it.

lbeltrame commented 6 years ago

Thanks for chiming in! The reason I mentioned a single coverage file it's because in bcbio this is the result of a pooled process: that is, it is the result of creating a reference with CNVkit using a pool of normals. This is the default approach used by bcbio, although the user can manually override it.

chapmanb commented 6 years ago

Luca and Markus; Thanks for this discussion and all the work on PureCN. We've started work on supporting PureCN and added a basic framework for preparing inputs with VarDict variant calls and CNVkit based segmentation. This gives us basic support for testing and exploration, but I haven't run any real life samples or validations yet, so there is likely much tweaking to do. So far I've focused on a tumor/normal case where we can feed CNVkit and VarDict calls directly, but would also like to work on supporting tumor-only with process matched normals and will look at getting some data to work with this.

I wanted to give you a heads up if you wanted to play with what's there now or have feedback/suggestions on the integration so far. I'll keep updating as we make progress. Thanks again for this discussion.

lima1 commented 6 years ago

Thanks Brad! I might have some bandwidth in the coming weeks to run bcbio on my test samples.

Compared to CNVkit, in the PureCN segmentation, I do 2 additional steps. I first try to attempt to estimate purity using very ad hoc heuristics to set the DNAcopy sd.undo parameter fairly conservatively (I don't want to remove CNAs in low purity samples).

Then I cluster segments using both b-allele frequencies and log-ratios. I'm not sure if this is necessary in your CNVkit pipeline, but it might be in general helpful to avoid over-segmentation (over-segmentation results in longer runtimes).

lbeltrame commented 6 years ago

In my testing, the hardest part I have found is select an optimal solution for what PureCN found, in particular because the export shows the probability for all the states considered, so it's a little tricky to evaluate (although the docs give some helpful pointers).

Some reasonable defaults should be set there (even downstream in bcbio if it makes sense) to allow an easier interpretation.

lima1 commented 6 years ago

Thanks a lot Luca for testing.

Older versions came with a script that tried to pick the best solution, but it was harder and harder to beat the likelihood model. I eventually removed it because it couldn't anymore.

In our assays, I would say the likelihood model picks the correct one in at least 90% of the cases. That's also what we get when we compare TCGA whole exome against Scott Carter's manual ABSOLUTE SNP6 curation.

Low purity and noisy due to limited input is harder, also cancer types with lots of heterogeneity like lung are usually trickier. Large pool of normals help a lot, especially without matched normals because it removes common artifacts that favor more complex solutions.

TMB and amplification calls are also fairly robust to inaccurate purity/ploidy, at least in tissue.

cfDNA is more difficult than tissue, I use a simple logistic regression trained on a few hundred cases to identify samples without copy number robust to noise.

Bottom line: in clean high quality data, it hopefully isn't a big problem, but if you can share test data where it appears to be, I'm happy to have look.

lbeltrame commented 6 years ago

To be more precise, using the procedure found in the docs: predictSomatic gives you a table, with all the possible states (which I assume they're then annotated in the VCF should one choose to do so instead) of germline and somatic status. The problem is selecting which events are important: aside removing flagged events, should potential filtering focus on the ML.SOMATIC field? I ask this to make sure that one selects the right alterations for further downstream (off-bcbio) analyses.

lima1 commented 6 years ago

Ah, yes. Adding a simplified VCF annotation is on the TODO list (https://github.com/lima1/PureCN/issues/12). Please fell free to suggest changes there.

The predictSomatic output provides the posterior probabilities for all states. A flag means the probabilities are potentially unreliable. ML.SOMATIC and POSTERIOR.SOMATIC are two common summaries of these individual genotype posteriors. ML.SOMATIC simply states that the most likely genotype is somatic, POSTERIOR.SOMATIC is the sum of all somatic genotypes. What you pick as cutoff depends on how you use it. For something like TMB, we are very liberal and just use ML.SOMATIC, because mis-classifications roughly cancel each other out. But sometimes you might need more confidence in somatic calls.

roryk commented 5 years ago

Thanks, we support PureCN now!

bcbio / bcbio-nextgen

RFC: PureCN in bcbio #2469