bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

RFC: allele fraction thresholds for paired analyses #314

Closed lbeltrame closed 5 years ago

lbeltrame commented 10 years ago

MuTect and VarScan has a threshold setting (--tumor_f_pretest) to select sites with at least a certain fraction of non-REF alleles. Something similar is in VarScan (minimum frequency to call an allele as heterozygote). MuTect has no preset, VarScan has 0.1 by default.

I'm wondering if (hence the RFC) this could be handled in the algorithm parameters, or at least harmonized between the two callers. Selecting a proper "frequency" (quotes, because you can't really call it frequency when you have just a sample pair) is important for validation.

Opinions? Pro, contra?

mjafin commented 10 years ago

Hi Luca, A bit of an aside, but would this be the delta in allele frequencies between tumor/normal (or sensitive/resistant or what have you) before accepting something as being there?

I wonder if there are any best practices default setting for this value and if not, would imposing the user to this setting be hazardous?

lbeltrame commented 10 years ago

A bit of an aside, but would this be the delta in allele frequencies between tumor/normal (or sensitive/resistant or what have you) before accepting

No, more like a minimal fraction of ALT reads vs REF reads in the tumor (MuTect) or in general (VarScan).

I wonder if there are any best practices default setting for this value and if not, would imposing the user to this setting be hazardous?

This was the reason I put it up as RFC. I'm not aware of best practices (but I admit I didn't look in depth), and I'd rather prefer best practices to allow users to kill their analyses with their own hands. However, if no best practices exist...

lbeltrame commented 10 years ago

Another reason I can think of is that very low fractions can lead to different results between runs (as in sequencer runs, not analysis runs), if coverage changes (that is what I've been seeing on my data today and the reason for this issue).

chapmanb commented 10 years ago

Luca and Miika; Thanks for the discussion. This is something we've been talking about offline. From what I hear the general assumption is that low frequency (< 20% mutations) are still considered the wild west and no one knows how to accurately assess these. This is where I think we need evaluate materials to make any kind of accurate assessment.

In terms of coverage, the paper by Meynert et al (http://www.ncbi.nlm.nih.gov/pubmed/23773188) is a useful guide. They suggest 13x coverage to detect heterozygotes in germline calling. So you're going to need much higher coverage to handle lower frequency detection in cancer.

My thoughts on handling this would be to have relatively low defaults on frequencies accepted for the individual callers and then have a shared post-calling filtration step that soft filters calls based on user specified depth and frequency. What do you think about that approach?

Thanks again for the discussion on this.

lbeltrame commented 10 years ago

offline. From what I hear the general assumption is that low frequency (< 20% mutations) are still considered the wild west and no one knows how to

Interesting: so the confusion that arises here when we talk about the variant calling results isn't unique. ;)

My thoughts on handling this would be to have relatively low defaults on

If that's the case, do we keep the VarScan default (0.1) or do we lower it?

post-calling filtration step that soft filters calls based on user specified depth and frequency. What do you think about that approach?

+1 on this. The only thing to watch out for is how such "frequency" is defined: MuTect uses FA (for allelic fraction) while VarScan uses FREQ.

chapmanb commented 10 years ago

Luca; Good point about the frequency representation in the VCFs. A useful post-calling step would be to normalize VCFs from multiple callers to have the same representation for these so that the user-configurable filter step could use one expression like (DP < 13 & FREQ < 0.6) || (DP < 20 && FREQ < 0.2) and have it work across multiple callers.

Miika, what's the minimum frequency you would like to report? My guess would be something like 5% as the minimum baseline to start. If we can get anywhere near good calling with that we'd be doing awesome.

lbeltrame commented 10 years ago

With regards to normalizing VCF, is there a good VCF library that would allow such a thing? I think neither PyVCF or cyvcf allow to create new FORMAT fields.

chapmanb commented 10 years ago

Luca; I don't know of a Python library beyond PyVCF. I normally use the Broad Picard variant libraries for fine grained manipulation. These are Java so I access them through Clojure within bcbio.variation and could supply a command line target within bcbio.variation to do this. Alternatively we could investigate what needs to happen in PyVCF to make it possible from within Python.

lbeltrame commented 10 years ago

There's an outstanding pull request for PyVCF (https://github.com/jamescasbon/PyVCF/pull/136) for FORMAT fields but it's not being merged as it lacks tests (and I don't think it'll be merged quickly). I don't mind using bcbio.variation, my question was more because if I wanted to implement this myself, I needed something in Python. ;)

If all else fails bcbio.variation could be used as well.

See also jamescasbon/PyVCF#82

mjafin commented 10 years ago

Hi guys, apologies about the radio silence, I'll be travelling until Friday. Regarding minimum AF to report, I'd say it would depend on depth and the requests from the scientists - sometimes 1% is being asked for (i.e. I don't know the answer to your question Brad!)

lbeltrame commented 10 years ago

I plan on doing some tests with results I have found here soon, hopefully: the wet lab people will be testing mutations I have found with varying fraction values (from 90% to 1%).

tanglingfung commented 10 years ago

it's possible to do the test with horizon multiplex panel: http://www.horizondx.com/what-we-offer/multiplex-reference-standards/multi-gene-multiplex, we have one data set with mutations from 24% to 0.9%, and I can test it out.

but the sensitivity/precision of mutation detection is also dependent to the sequencing depth, which may not be represented in a vcf file

mjafin commented 10 years ago

A few further things to consider: currently FreeBayes paired variant calling requires that no observations in the germline support the somatic alternate (the -s option.) However, in some studies, such as comparing sensitive and resistant samples, one might be interested in a significant change in allele frequencies rather than requiring there to be no strictly no observations in the germline supporting the somatic alternate. What are your thoughts on not using -s and doing post filtering instead?

Edit. I can see that vcfsamplediff only adds a tag (somatic or germline), so there's no hard filtering, right?

lbeltrame commented 10 years ago

significant change in allele frequencies rather than requiring there to be no strictly no observations in the germline supporting the somatic

Good observation. Given that I'm mostly working on tumor resistance to pharmacological treatment (but I haven't incorporated freebayes yet), I would suggest myself dropping the -s option. I don't think VarScan and MuTect are as strict as Freebayes on this regard anyway.

On why I put this in: this was the command suggested in the BioStars thread (see my PR).

Brad, do you agree on dropping -s and moving whatever necessary to the post- calling step?

chapmanb commented 10 years ago

I agree, makes good sense. I'm also working on putting together a test dataset for evaluating a range of frequencies so we can start pushing the limits of these callers and be able to assess how well we're doing. Looking forward to actually having evaluation numbers for this soon.

mjafin commented 10 years ago

I did some testing with FreeBayes on files that have the following allele frequencies for a SNP: N1: 0% T1_1: ~5% T1_2: ~12% N2: ~75% T2: ~98%

Using the default settings, no SNP is detected by FreeBayes for N1/T1_1 or N1/T1_2. If I change --min-alternate-fraction to 0.05, I'll see the SNP in N1/T1_2. If I further drop the --pvar 0.7 option, the SNP in N1/T1_1 pops up, but its genotype is always 0/0. It says in FreeBayes help for --pvar that Note that post-filtering is generally recommended over the use of this parameter

For N2/T2 the genotypes are 0/1 for N2 and 1/1 for T2. Without the -s flag the SNP in T2 is classified somatic. With -s no tag is printed out (the why is clear from the source code here: https://github.com/ekg/vcflib/blob/master/src/vcfsamplediff.cpp#L173 i.e. in strict mode nothing is printed out for this scenario - I wonder if it's fully intentional)

Further, loh is never printed out as it's been commented out in the code.

lbeltrame commented 10 years ago

N1/T1_2. If I change --min-alternate-fraction to 0.05, I'll see the SNP in N1/T1_2. If I further drop the --pvar 0.7 option, the SNP

Thanks for the initial analysis, Miika. I would argue that we should definitely drop -s. At least for my use case it's too strict.

help for --pvar that Note that post-filtering is generally recommended over the use of this parameter

Brad, should we also drop --pvar?

in strict mode nothing is printed out for this scenario - I wonder if it's fully intentional)

That explains the strange results I was getting (I didn't look at the code).

Brad, if you agree with dropping any (or either) of the options, I'll prepare a PR for that.

chapmanb commented 10 years ago

Luca and Miika; I agree. The main reason for --pvar 0.7 is otherwise FreeBayes would report variants with extremely low quality scores (< 1) that would cause GATK to choke. In the germline calling, I replaced this with a call to VCF filter that sets a very low minimum quality score:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/freebayes.py#L77

That should be a good approach here as well. Setting a lower --min-alternate-fraction makes good sense as well. Thanks much for all the testing and discussion.

lbeltrame commented 10 years ago

I've been busier than expected, but I plan to submit a patch to remove "-s" soon, at least to make it less strict. I'll look into adopting the same filter for quality scores for somatic calls.

mjafin commented 10 years ago

Hi guys, Just to bump this up again, would it make sense to expose a setting for minimum allele frequency threshold that individual tools, which support it, could implement? We have currently some ultra high depth panel data where we'd like to take a look at sub 1% variants, so I doubt there will ever be a consensus on what the "correct" minimum AF is, don't you agree?

chapmanb commented 10 years ago

Miika; Apologies this is lagging, I haven't had time to put together the evaluation set yet. In the meantime I'm agreed it would make sense to have this configurable with something like min_allelle_freq in the algorithm section. We could default it to whatever we decide is a useful limit of detection for cancer samples but you're right that it's going to need to be configurable anyhow so no harm in putting it in place first so you can use it. I can have a go at this if you haven't already started on it. Thanks much.

lbeltrame commented 10 years ago

I agree with what Brad said. I'm aware that VarScan2 allows changing the minimum fraction threshold, but I'm not sure about the equivalent MuTect option.

mjafin commented 10 years ago

Hi Brad and Luca, Excellent, sounds good! I haven't started on this as I wanted your OK on this before doing anything. If it's not a big deal for you to add the basic infra, I can then further implement the changes in FreeBayes and SID (mutect doesn't use this AFAIK). Maybe VarScan too

tanglingfung commented 10 years ago

@mjafin @chapmanb , that's very nice to have. is it ok to make --min-base-quality (in freebayes) configurable as well?

chapmanb commented 10 years ago

There's a nice paper out in BMC Genomics that looks at sensitivity/specificity using mixture of NA12878 and the Genome in a Bottle Reference. They evaluate MuTect, SomaticSniper, Strelka and VarScan2 at 8%, 16%, 36% and 50% mixes:

http://www.biomedcentral.com/1471-2164/15/244

Strelka and MuTect performed best but sensitivity at low ratios is not good for any tool. We'll definitely have to have different panel-specific and whole-genome-like references based on their evaluation.

chapmanb commented 10 years ago

Miika and Luca -- I added a min_allele_fraction argument we can use to tune minimum detection in tumor/normal callers. I also added a tumor_config attribute from the namedtuple out of vcfutils.get_paired_bams so we have a consistent place to get the setting.

I updated this to be used in Miika's latest changes but not in general. Once we're happy with support across the callers we can close this issue.

Paul -- I'd really like to avoid having caller specific arguments if we can help it. If you describe what you'd like to accomplish maybe we can think of a more general way to tackle the problem.

Thanks again all for the work and suggestions.

tanglingfung commented 10 years ago

Thanks Brad. No problem. Mine is quite specific to my data.

tanglingfung commented 10 years ago

what's the (default) min_allele_fraction of the standard (or general) tool now? is it also 20%?

lbeltrame commented 10 years ago

In VarScan, it is 10% (0.10). In MuTect I don't know, but I think it goes down to 1% (0.01) with default settings (at least by looking at the VCFs it produces).

chapmanb commented 10 years ago

Paul -- for custom flags, you can use a resources -> freebayes -> options setting in your sample.yaml input. Here's an example in the tests for novoalign of passing a custom flag: https://github.com/chapmanb/bcbio-nextgen/blob/master/tests/data/automated/run_info-bam.yaml#L38

Luca and Miika -- I'm happy with whatever allele fraction you think it makes sense to start with as a default. I arbitrarily picked 20% but if you're working on normalizing this across callers we should choose something that nearly matches defaults. As we get some ability to validate these we can also adjust to one with a reasonable FDR.

mjafin commented 10 years ago

Thanks for the fixes Brad! I noticed my own copy-paste typo while browsing through your changes https://github.com/chapmanb/bcbio-nextgen/pull/390

The Qiagen paper is quite nice but they seem to only have called SNPs, right? Heterogeneity in tumor samples may lead to weird events like overlapping indels that may present problems for different callers

lbeltrame commented 10 years ago

sense to start with as a default. I arbitrarily picked 20% but if you're

Roughly, 20% is the minimum that can be detected by Sanger sequencing: TaqMan assays go down to 10% and other methods (pyrosequencing, digital PCR) down to 5-1%. IMO 20% is too high of a fraction as default.

mjafin commented 10 years ago

Btw. just looking at MuTect output with a SNP FA reported at 1.015e-03 so don't think MuTect has any hard lower limit.

tanglingfung commented 10 years ago

thanks Brad for the tricks, it's good enough.

freebayes has a default of 0.2 for --min-alternate-fraction when ploidy is set to 2. I think 0.2 is picked, originally, for germline mutation detection.

chapmanb commented 10 years ago

Luca -- I'm happy to go with whatever y'all think is best for the default value. I only picked 20 because that's the FreeBayes default, which is no reason at all. You and Miika both have more experience so feel free to swap it over to whatever you believe is sensible.

lbeltrame commented 10 years ago

I'm still struggling to equate fraction to "can this be validated?" status. ;) Where is the default set? Also I'm thinking that perhaps it should be a generalized option since VarScan2 can handle that as well (Miika, does MuTect have an option for minimum FA to consider? I can't check at the moment)

mjafin commented 10 years ago

Hi Luca, The default isn't currently set in a central location but rather in each of the tool wrappers. In the wrappers it's checked if the argument was given in yaml and if not, a default of 10% is set. This could be centralised I guess.

MuTect doesn't han an option for minimum AF and I've seen it report sub-1% variants (although they were reported as REJECT). The SomaticIndelDetector, which currently only works for Appistry users, has a lower limit of AF, below which it hard filters InDels out.

lbeltrame commented 10 years ago

I'm guessing this paper might be helpful for further discussion: http://www.biomedcentral.com/1471-2164/15/244

chapmanb commented 10 years ago

Luca; Thanks for linking that here. The artificial mixture standard is a nice approach. My main takeaway was that < 18% detection is still extremely difficult, independent of caller. I don't understand why Figure 3 is not stratified by allele frequency like Figure 2 but the sensitivity at low frequencies was 50% or worse.

Strelka might be worth investigating. The main downside of Strelka is practical -- it has a lot of moving parts and will be hard to install. I know you looked at this earlier and found the same result. Good to know MuTect is performing well so maybe it is worth focusing on it plus Scalpel in the short term.

lbeltrame commented 10 years ago

FYI, fractions lower than 10% can be successfully validated using droplet digital PCR (case in point: 3% which we've done internally) but it's extremely hard.

Up to now I've been only excluding everything with a fraction < 1% and with arbitrary depth thresholds depending on the sample.

We've also selected another panel of mutations to test from both MuTect and VarScan 2, some with varying fractions (from 50% to 5%). I'll keep you guys posted once the results are out.

lbeltrame commented 10 years ago

Bumping because the results are in! Disclaimer: alll of these were done with droplet digital PCR and on targeted resequencing (~ 50 genes):

In all cases we checked the variant also against negative controls (matched normal and independent negative control).

My opinion is that, at least in this limited case, even low fractions (3% or so) from callers. in particular MuTect (we have a very high validation rate from this one) are possibly true positives. Whether they are significant for the biological issue at hand it's another matter entirely. ;)

mjafin commented 10 years ago

Thanks Luca that's really interesting and useful to hear. How did varscan compare to mutect in terms of false positives and false negatives?

lbeltrame commented 10 years ago

That's what we're testing now. ;) We used mostly MuTect due to precision, but there are a few assays against VarScan 2 calls. Unfortunately the one tested yesterday has possibly some issues by itself so people are redoing the run - I'll keep you posted.

chapmanb commented 10 years ago

Luca; Thanks much, this is great work. It sounds like useful default behavior for tumor/normal panels would be to calculate average depth, based on the work @kern3020 is doing with Chanjo, and then use this to adjust default minimum detection percentage. We can always allow folks to override it, but this could help get a reasonable detection default based on depth. Thanks again.

audyavar commented 5 years ago

Is varscan2 part of BCBIO1.1.6? I don't see it in the configuration options documentation on bcbio readthedocs yet.

chapmanb commented 5 years ago

It is, if you add varscan to your variantcaller specification it will use that. We emphasize using other somatic callers (VarDict, MuTect2, strelka2, octopus) which are actively developed and we feel do a better job, but it's still available to use. Hope this helps.

audyavar commented 5 years ago

Thanks Brad. I typically use a combination of 3 callers - one of which is varscan. When bcbio runs, its says Varscan and not varscan 2, so was wondering which version it was.

Best

Akshata


From: Brad Chapman notifications@github.com Sent: Friday, May 3, 2019 2:38 AM To: bcbio/bcbio-nextgen Cc: Akshata Udyavar; Comment Subject: Re: [bcbio/bcbio-nextgen] RFC: allele fraction thresholds for paired analyses (#314)

It is, if you add varscan to your variantcaller specification it will use that. We emphasize using other somatic callers (VarDict, MuTect2, strelka2, octopus) which are actively developed and we feel do a better job, but it's still available to use. Hope this helps.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/bcbio/bcbio-nextgen/issues/314#issuecomment-489032607, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AASBJUJFODYVCVCJKWVLZT3PTQBY7ANCNFSM4AMK7ZIQ.

chapmanb commented 5 years ago

Akshata; That's right on, the original varscan isn't really used so we didn't need naming to differentiate them. If you want to check the version of any tool installed with bcbio you can use the conda list functionality:

$ bcbio_conda list | grep varscan
varscan                   2.4.3                         2    bioconda
roryk commented 5 years ago

Looks like this is some old discussion that got bumped with a question about varscan, which got answered. Closing this, thanks everyone!