Truth sets and organization for Titration and RAS panel validation

Illumina / Pisces

Somatic and germline variant caller for amplicon data. Recommended caller for tumor-only workflows.

GNU General Public License v3.0

94 stars 16 forks source link

Truth sets and organization for Titration and RAS panel validation #14

Open chapmanb opened 6 years ago

chapmanb commented 6 years ago

Tamsen and all; Thanks for your work on Pisces and for the work on the pre-print. The titration and validation datasets you've made available are an incredible resource and I'd like to make them part of the standard validations we run on somatic callers (https://github.com/bcbio/bcbio_validation_workflows). I've been looking through the Basespace project sharing the supplemental materials and found all the BAM files but had a couple of questions:

Do you have VCF truth sets for expected variations in the Titration and RASPanel?
Is there a publicly available BED file of the regions examined in those two validations?
Are the multiple Titration BAMs at each dilution percentage replicates? How did you combine or otherwise process them prior to your paper validations?
How do the RASPanel inputs get called/validated? Are these independent samples with unique validations in each? Have you given thought to being able to combine these into a single (or fewer) sets with potentially multiple mutations?

Thanks again for this great work and resources, looking forward to working with the data and Pisces.

tamsen commented 6 years ago

Hi,

Thanks for your interest.

Do you have VCF truth sets for expected variations in the Titration and RASPanel?

Yes, I have these files. Its quite a lot of data. Happy to share. Let me see if there is an easy way for me to put them up on BS.

Is there a publicly available BED file of the regions examined in those two validations?

Yes, I can add that also.

Are the multiple Titration BAMs at each dilution percentage replicates? How did you combine or otherwise process them prior to your paper validations?

Yes, they are replicates. There was no pre-processing of the bams.

How do the RASPanel inputs get called/validated? Are these independent samples with unique validations in each? Have you given thought to being able to combine these into a single (or fewer) sets with potentially multiple mutations?

For the RAS panel data, the BAMs are actually pairs. The same tumor sample was essentially split into 3 wet samples. One went to independent validation (therascreen and/or sanger), and two wet samples were run through a MiSeq using complimentary probe pool designs, producing two uniquely paired BAMs (covering the same regions, but different tiling). So there is one unique truth file per pair of RAS BAMs.

I considered combining them, but my general thought was to be as "hands off" as possible with the input data, and keep it as raw as possible.

Thanks again.

chapmanb commented 6 years ago

Tamsen; Thanks so much for the orientation and offers to make the truth and BED data available. This would be so helpful.

For data combining/organizing, my goal was to have a few different test sets derived from these that we can regularly run to compare and improve tumor-only callers. So any thoughts about what might be a "minimal" set of combined BAMs if I didn't mind being more hands on would be very helpful once you've had time to organize all the truth sets.

Thank you again for all this work.

tamsen commented 6 years ago

Truth and (slightly) more minimal data sets are here. https://basespace.illumina.com/s/6vuFN2Cn0Ems

A few thoughts-

The RAS dataset is a nice set for testing tumor only callers, although it does not include indels. The RAS frequencies across the dataset form a nice curve peaking at 30% but go all the way down to 1%, so there is a good frequency distribution from a real set of old FFPE samples. If you synthetically recombine the bams (ie, adding S1 and S2 bams together, which come from different probe pools, or even just combining bams that might have been processed through different sequencers), you will loose some of the power to really stress the amplicon callers, because the bias in the observations will start to wash out.

The titration data has many 'golden' indels, but it's not typical of the clinic. It is a good counterpoint to the drawbacks of the RAS dataset.

Feel free to email me if you want to discuss further (pisces@illumina.com). Your project sounds awesome, a real enabler for open source development. Where can I find the details on what goes into your accuracy assessments?

chapmanb commented 6 years ago

Tamsen; Thanks so much for making these available. I've organized them into a data collection, provided download scripts, and prepared bcbio and Common Workflow Language setups to make it easy for others to prepare and run:

https://github.com/bcbio/bcbio_validation_workflows#somatic-low-frequency-variants

Initially I've taken a single replicate from each to use, and you can see the details of those in the configuration inputs (used to generate the CWL and run):

https://github.com/bcbio/bcbio_validation_workflows/blob/master/somatic-lowfreq/pisces-ras.yaml https://github.com/bcbio/bcbio_validation_workflows/blob/master/somatic-lowfreq/pisces-titr.yaml

Here's my first pass at validations comparing what we're seeing in bcbio to your baselines from the paper. This includes Pisces, VarDict and FreeBayes (we don't have LoFreq in bcbio):

https://github.com/bcbio/bcbio_validations/tree/master/somatic-lowfreq

I'd love to get your assessment on how these compare to what you're seeing. In general the sensitivity/specificities I see are lower than your report in the paper, but I'm not sure if you're using homozygous reference calls in those calculations as well. I only use variant positions in these comparisons. We're using rtg vcfeval for the comparisons and then calculating the sensitivity/specificity from what is reports. I'm comparing by position only (not requiring specific het/hom ploidy) as I noticed that some of the truth sets different from what the callers report (this directory has the equivalent graphs with strict ploidy: https://github.com/bcbio/bcbio_validations/tree/master/somatic-lowfreq/pisces/check_ploidy), so trying to avoid any bias there.

This is an awesome truth set and I appreciate all your work making it available. I'd love to keep iterating on improving this validation setup so we can improve the callers and representative validation outputs. Thanks again.

tamsen commented 6 years ago

Hi Brad,

Likewise! It would be excellent to have independent validation / comparison of Pisces, and your feedback and results are very welcome. That is a huge amount of work from your team, to put together a validation pipeline like that. And it is so critical to our efforts. Thank you!

I would be very happy to continue to work with you on this, both as we iterate Pisces in future releases, and on validation methods. I would also like to get more data your way, so I will look for some bigger panel data sets. Is your entire analysis pipeline something I can download and install locally? Imo, your accuracy assessment should not in theory be different than ours, but in practice I suppose these things sometimes are. Our analysis was very simple and should be easy to reproduce: Pisces 5.2.5 -> vcf -> hap.py / som.py -> precision/recall. I will try to put the whole thing on docker, for transparency. To try to narrow the gap between our assessments: 1) We used Pisces 5.2.5 (we took latest from all variant callers at the time we wrote the paper) 2) Pisces command lines: -bam {Bam} -CallMNVs false -g {genome folder} -gVCF false -i {interval file} -OutFolder {outfolder} -RMxNFilter 5,9,0.35 The rmxn filer is the only non-default parameter we used. In Pisces 5.2.8 and up, -RMxNFilter 5,9,0.35 has been made default. 3) Pisces writes out one allele per line, instead of "crushing" all co-located alleles into one loci per line. Ie, Pisces vcf output will have multiple lines per loci. 4) Like you, we assessed by position/called allele only and not genotype. How GT gets reported / represented seems to vary a lot between truth and the various varcallers. (Pisces somatic in particular reports 1/1 only if no ref allele exists >1%, and reports 1/. for variants inside deletions, etc) 5) Does your validation pipeline include any scoring for timing? We consider "time to answer" a very valuable metric. We would welcome an independent assessment of speed, as well as accuracy, if you have it.

Thanks again! And I apologize for the time its taken me to get back to you. I am currently living out of a laptop, on an overland journey from Cornwall to Beijing. Bandwidth is going to be occasionally limited until I get back to the office in October. :-)

chapmanb commented 6 years ago

Tamsen; Thanks much for this and for responding while on your long trip. That's an amazing adventure and definitely understood that you might be just slightly busy.

I'd love to coordinate with you on this when you have time and availability. All of the workflows I've run here are available with scripts for downloading and ready to run CWL workflows, using bcbio for the run and rtg for validation comparisons:

https://github.com/bcbio/bcbio_validation_workflows/tree/master/somatic-lowfreq

These use Pisces v5.2.7.47, which I think is the latest, although you mention 5.2.8 above so maybe I'm missing a release? I used the defaults in Pisces for this, so didn't have the RMxNFilter adjustment but otherwise should be similar to what you have. rtg should handle the one allele per line approach, so that works great with me. I'd definitely like to have timing data associated with these as well, but have been focusing on validation/improvement as a first pass and then was going to look at timing tradeoffs.

If you'd like to work directly on this I also have the data runs and outputs available on a GCP project I could share with you, which would avoid needing to install bcbio and the data and workflows.

I'd also love to explore other truth sets and also compared VarDict, Pisces and FreeBayes with some tumor-only 0.5% allele frequency calls from the smcounter2 paper. I'm struggling getting Pisces to call low frequency variants. I've tuned --minvf 0.004 to capture the .5% but am not seeing any of the expected low frequency call:

https://github.com/bcbio/bcbio_validations/tree/master/somatic-lowfreq#low-frequency-umi-tagged-tumor-only-samples

Are there other tuning metrics I can adjust to increase low frequency sensitivity? The missing calls aren't present at all in the VCF (as opposed to present but filtered) so I'm not positive how best to start debugging and tweaking to make the outputs more sensitive. I'd be open to anything to try to help improve Pisces here.

Thanks again for all the help and collaboration on this and hope you're enjoying your trip.

tamsen commented 6 years ago

Hi!

Yes, I'm still travelling. Made it from Cornwall up to Scotland, then through France and now in Switzerland catching my breath. :-)

re GCP project. Having turn-key validation (written by impartial experts) is very welcome! Let me follow up with you on that when I clear my plate a bit.

We are still working on making 5.2.8 available to the outside world. (We need to back out some internal libraries that sneaked in). Everyone's feedback since the pre-print has been very helpful and we are trying to make future releases more friendly to the community.

0.5% frequency should work fine, BUT NOT WITH THE DEFAULTS. We often use Pisces for much lower frequencies, but we have not made it very friendly to do so (we could do better here..). When we use Pisces with very low VF, its usually with UMI bams where we have created read-collapsed bams that have higher (informatically raised) base call qualities. The main thing is that Pisces bases its noise model on the basecall qualities in the bam. By default, Pisces only keeps bases >Q20, so it has confidence in its internal resolution down to about 1% VF . However, if you raise the basecall quality filter, say to Q30, it will trust variant calls down to 0.1% (at which point you hit a limit where the expected noise and the min VF detectable start to merge).

If you bam has enough high-quality bases, set -MinBaseCallQuality 30 . That should do the trick. If this gives problems because not enough bases now pass, (observable as significant drop in coverage in the vcf,) you can kick it back down, and spoof the noise model (NoiseLevelForQModel 30), so it has more confidence in the bases than qscores in bam really suggest.

Pisces also has default frequency limits at 1%, so you also have to lower those. So set your MinimumFrequency 0.0001 (this is the emit filter), and MinVariantFrequencyFilter to 0.001 (or lower, as needed, will remove the "low variant quality" filter from being stamped on those 0.5% variants you are looking for)

If you still can't find anything, we can dig around in the noise by lowering the emit filters further. (MinVariantQScore 20 -> 10) etc. But this might be more fiddling than you want to do...

example commands here - https://github.com/Illumina/Pisces/wiki/Pisces-Quick-Start-5.2.7

related issue here - https://github.com/Illumina/Pisces/issues/12

Thanks again for your continuing support and interest in the project! It is very appreciated.

chapmanb commented 6 years ago

Tamsen; Wow, France and Switzerland are pretty good places to relax. Hope vacation is still going well and thanks for taking the time to help out with these comparisons. Your suggestions were spot on and here is an improved validation on these same 0.5% variants:

https://github.com/bcbio/bcbio_validations/tree/master/somatic-lowfreq#low-frequency-umi-tagged-tumor-only-samples

Pisces performs similar to VarDict on SNPs and on Indels appears to have less sensitivity but more specificity.

I'll continue to work on this dataset, ideally looking at variants that consistently are FPs and FNs across callers. I definitely appreciate all the help and feedback, and happy to try out any other tweaks or suggestions. Thanks again.

tamsen commented 6 years ago

Hi! Glad it helped! And thank you for the helpful results on the 0.5% data! Thanks again for including Pisces in these analysis.

btw, I put up a docker file to reproduce the Pisces side of the Pisces paper analysis up here: https://github.com/Illumina/Pisces/wiki/Pisces-Paper-Analysis

In Berlin now. Will be crossing Asia via Tran-Siberian railroad in a few weeks. I'm hoping to contribute to a more substantial collaboration when I get back to the office in a few months. :-)

best Tamsen

chapmanb commented 6 years ago

Tamsen; Thanks so much for the Docker file and suggestions on filtering parameters. Your trip sounds amazing; everything I know about long distance train travel across Asia is from The Great Railway Bazaar. I don't remember any bioinformatics getting done in that book, so appreciate you being able to check in. Thank you again and have a great trip.