ga4gh / benchmarking-tools

Repository for the GA4GH Benchmarking Team work developing standardized benchmarking methods for germline small variant calls
Apache License 2.0
188 stars 46 forks source link

Comparison around confident region boundaries #11

Closed Lenbok closed 8 years ago

Lenbok commented 8 years ago

@jzook said in #4:

In terms of prefiltering, my one concern with this is where part of a complex variant is excluded by the bed file in one representation and included in another representation. I've found this to be quite common in FPs/FNs, and it can be helped by doing the filtering after comparison, but it gets complicated - maybe this should be a different issue?

And @pkrusche also said:

W.r.t. prefiltering I also agree with @jzook -- the main reasoning for doing the UNK regions after comparison in hap.py is that this allows us to be more accurate at the boundaries of confident regions -- when variants have different representations in the query, they might not necessary fall into the confident call regions, so filtering before comparison would wrongly give FNs.

This is quite interesting, and I've not previously thought a great deal about the confident regions, and had assumed that they were relatively few in number so that these edge effects would be fairly minor. I was surprised to find that the confident regions BED that comes with Illumina PG 8.0.1 truth set (what I have handy) contains 5.8 million regions (and this for only 4.5 million variants in the corresponding truth VCF).

The confident regions are being used to say: "Within the regions, we are confident that the truth set contains all the variants." The implication is that the genomic sequence outside of the regions could not be confidently determined (and so may be variant or non-variant). What are some of the reasons why regions are excluded?

When doing comparisons near the edges of these regions, some of the cases we need to be careful about (there are more, but let's just worry about these):

a) A call made outside the confident region could not be matched to a truth variant (thus assigned match status UNK). (These may or may not match truth-rejects.)

b) A truth variant can be matched by a call made outside the confident region. (This should be a TP, but which pre-filtering calls to the regions would incorrectly result in an FN)

c) A call made inside the confident region could not be matched to a truth variant. (Seems like this should be a FP, but hang on, what if the call would be equivalent to a truth-reject -- it should be UNK, right? What if it was "nearly equivalent" such as a zygosity error -- should also be UNK since we're not confident in the truth-reject correctness?)

Looking at the distribution of region lengths, the Illumina PG 8 dataset has over 600k confident regions that are a single base long and another 340k that are only two bases long. Why are they so short? How can you be confident at a single base and yet not be confident at the bases on either side? These locations seem like they would be vulnerable to pre-filtering FN (case b above).

Similarly, this dataset has 3.6 million cases where the gap between consecutive confident regions is only a single base long (and another 400k where the gap is 2 bases long). This suggests the position is being excised due to call inconsistency during truth set construction. These cases seem like they would be very vulnerable to generating FP during comparison time for any callers that don't happen to use the same variant representation as went into the PG construction (whereas a caller using the same positional representation would get UNK instead) (case c above). This is regardless of whether region filtering is pre or post comparison, with the exception that it turns out that PG 8 has 200k "silver" variants that are outside the confident regions (which will alleviate things for alternate representations that do match these decoys, but it seems like a drop in the bucket).

It would be interesting to look further at the magnitude of these effects. Just going by the numbers above, case c should be four times larger than case b. Case b is the easier to evaluate, just compare pre and post filtering results, although as mentioned above I'm quite sceptical of those confident regions that are only one or two bases long in the first place. Case c effects, is harder. It would also be interesting to give truth variants a score which is the distance from the variants to their nearest confident region boundary, and plot a "ROC curve" with this score - I would assume that those with a larger distance are more likely to be TP in a typical query set.

pkrusche commented 8 years ago

The reason that there are very short confident blocks in PG is that hom-ref calls and calls with an alt allele are validated in a different way:

The very short confident regions will mostly come from places where we were able to make a confident variant call, but where we failed to call the surrounding region hom-ref potentially due to low depth in one sample / alignment noise / ... Since the input into PG comes from unfiltered variant calls, we are likely to see this type of scenario quite frequently. Ideally, we should also validate the reference bases around this variant call (we are working on this).

I think the majority of these cases will be SNPs or small indels which end up with a clear TP/FP/FN classification just based on the genotype. If the genotype is wrong here (0/1 instead of 1/1, ...), then the call should be counted as a kind of FP, not an UNK.

Things get trickier in places where we can only make part of a complex call in a pedigree-consistent fashion, and have surrounding calls as truth-rejects -- here, I think a stronger argument could be made for leaving such query calls as UNK. Maybe one way to address this would be to have "uncertain" regions to flag places where we have rejected truth variants, but weren't quite sure. However, this type of region can make the comparison problem much harder.

RebeccaTruty commented 8 years ago

Aren't these not-high-confidence regions exactly what we define as a no-call? And in the sense that @bioinformed is treating TRUTH and QUERY as just VCF1 and VCF2, either of which can contain nocalls, he already has some thoughts on how to address this. Our comparison methods also have some definitions for this -- no-call-reference and no-call-variant. These are focused on QUERY no-calls, so perhaps what we should do is extend the comparison method definitions to include TRUTH no-calls. CGI tools have the concept of "consistent" calls, meaning that, due to no-calls, a variant call isn't explicitly concordant or discordant with the truth, but it's possible. So, referring back to #4, we would have missing > lose match > consistent > almatch > gtmatch

I'm not sure if consistent goes before or after lose, and I'd need to think through the other TP/FP/FN/nocall definitions (and define some new summary metrics = consistent percent? ) Thoughts?

jzook commented 8 years ago

My approach for NISTv2.19 high-confidence calls was to exclude 50bp on either side of variants that could not be confidently called (e.g., because our methods couldn't determine the reason for discordant genotypes/variants between datasets). This generally works, but there are a small number of cases where excluding 50bps isn't sufficient so that we end up calling part of a complex variant. This sometimes happens when part of a complex variant is within 50bps and part is outside 50bps.

My current thinking for our next version of calls is that we should still exclude 50bp on either side of uncertain variants from our bed file. However, we should keep the high-confidence calls that are within these 50bp regions while doing variant comparisons, and only exclude FP/FN calls in these regions after doing the comparisons. I think this will avoid most of the edge cases, and I suspect Platinum Genomes calls would benefit from doing something similar.

The one issue with only excluding calls outside the bed files after comparison is for targeted sequencing, where the comparison time is likely to be much longer when excluding calls afterwards. If the run time is too long, we could potentially just keep variants near the targeted regions.

Rebecca, I think you made some good points but I didn't see it until after writing this so I' ll have to think about your ideas

On Tuesday, January 5, 2016, Peter Krusche notifications@github.com wrote:

The reason that there are very short confident blocks in PG is that hom-ref calls and calls with an alt allele are validated in a different way:

-

hom-ref calls come from an intersection of hom-ref call regions across the pedigree -- everything that is called 0/0 across the pedigree in at least 2/3 aligners becomes "confident hom-ref"

variant calls are validated against the pedigree phasing information. If we can phase them without getting any GT conflicts with our inferred haplotypes in the pedigree, they become confident variant calls

The very short confident regions will mostly come from places where we were able to make a confident variant call, but where we failed to call the surrounding region hom-ref potentially due to low depth in one sample / alignment noise / ... Since the input into PG comes from unfiltered variant calls, we are likely to see this type of scenario quite frequently. Ideally, we should also validate the reference bases around this variant call (we are working on this).

I think the majority of these cases will be SNPs or small indels which end up with a clear TP/FP/FN classification just based on the genotype. If the genotype is wrong here (0/1 instead of 1/1, ...), then the call should be counted as a kind of FP, not an UNK.

Things get trickier in places where we can only make part of a complex call in a pedigree-consistent fashion, and have surrounding calls as truth-rejects -- here, I think a stronger argument could be made for leaving such query calls as UNK. Maybe one way to address this would be to have "uncertain" regions to flag places where we have rejected truth variants, but weren't quite sure. However, this type of region can make the comparison problem much harder.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/benchmarking-tools/issues/11#issuecomment-168956151 .

RebeccaTruty commented 8 years ago

I just had a quick look at the PG 8.0.1 ConfidentRegions and tried to apply something like Justin's approach (although I didn't account for where variants are with respect to the confident regions). I removed any regions that were less than 100bp and for longer regions trimmed both ends by 50bp. This reduced the total number of covered bases in the ConfidentRegions from 2.76B to 2.47B, about a 10% reduction. It also reduced the number of regions from 5.8M to 2.1M, confirming @Lenbok's observation that there are many small regions. So, assuming that many of these small regions are close to variants, you could potentially be losing a lot of information with this sort of method. That is maybe something we can live with, but I'd love a better solution.

jzook commented 8 years ago

I think you're right, Rebecca, that an approach like Kevin has discussed could definitely help with this issue, since we do lose high-confidence variants using my approach. My understanding is that Kevin is still working on how to handle uncertain variants in the comparison process, and I don't think hap.py or vcfeval can handle them at this point, though it would be good for them to confirm. vcfeval does generate ROC curves, so I wonder if there might be a way to adapt this to accept 2 levels of confidence in the benchmark calls as well. I haven't really thought through this sufficiently to know how it would work. In some ways, I still think it might be better to exclude benchmark variants that are close low confidence regions or variants because they may not really be as high-confidence as we thought (e.g., it might actually be at the breakpoint of an SV or inside a duplicated region). It depends on why the variant or region is low confidence though, and I'm sure there are some cases where a high-confidence variant could be near a low confidence variant or region, and we might sometimes want to include these in benchmarking.

@RebeccaTruty, did you look at how many variants were lost in addition to how many regions were lost from PG 8.0.1?

On Wed, Jan 6, 2016 at 3:41 PM, RebeccaTruty notifications@github.com wrote:

I just had a quick look at the PG 8.0.1 ConfidentRegions and tried to apply something like Justin's approach (although I didn't account for where variants are with respect to the confident regions). I removed any regions that were less than 100bp and for longer regions trimmed both ends by 50bp. This reduced the total number of covered bases in the ConfidentRegions from 2.76B to 2.47B, about a 10% reduction. It also reduced the number of regions from 5.8M to 2.1M, confirming @Lenbok https://github.com/Lenbok's observation that there are many small regions. So, assuming that many of these small regions are close to variants, you could potentially be losing a lot of information with this sort of method. That is maybe something we can live with, but I'd love a better solution.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/benchmarking-tools/issues/11#issuecomment-169455194 .

RebeccaTruty commented 8 years ago

Looks like 864079/4346824 = 20% of variants are within 50bp of the edge of a confident region.

pkrusche commented 8 years ago

Perhaps one thing to point out here is that most of these variants are not necessarily uncertain. The reason that they don't have surrounding hom-ref regions can just be that this place has low depth even in just one sample across the Platinum Genomes pedigree (i.e. the calls probably are fine in most of the other samples). This means that the call (and also the surrounding +/- N bases) are probably confident/good variant calls. We are working on an automated procedure to improve the callset by adding more confident regions around such variants in the samples where we are certain that these are good calls (which should probably cover a very large part of these cases).

In some way, this is also why I think that breaking out genotyping FPs into FPs where the genotype is wrong ("overcalls" / "undercalls") and FPs where we make a call in a place that should be hom-ref is important at least in the case of Platinum Genomes as a truth set: the way that genotypes for variant calls are validated is different from the way that we validate homref regions.

Of course, being able to incorporate the remaining cases as uncertain variants would be nice, but I am not sure this is necessary in a first version of the comparison tool standard.

jzook commented 8 years ago

Yeah, I think the decision of how to handle these comes down partly to how conservative you want to be in developing high-confidence calls for benchmarking, and I think we can leave this decision to the benchmark call set developers.

Do we all agree that a good initial way to address this problem is to apply the bed files after the vcf comparison rather than before, as Peter depicts in the benchmarking tool architecture diagram ( https://github.com/ga4gh/benchmarking-tools/tree/master/doc/ref-impl)? If so, I propose adding an explanatory note below the diagram like this: Note that "Confident Call Regions" from Truth and/or Query are applied only after the VCF comparison to enable better comparison of variants that can have different representations putting them either inside or outside the confident call regions. In contrast, if only variants inside the confident call regions are included during the VCF comparison, then some TP calls are incorrectly classified as FPs and/or FNs.

Does this sound good to everyone?

On Thu, Jan 14, 2016 at 10:18 AM Peter Krusche notifications@github.com wrote:

Perhaps one thing to point out here is that most of these variants are not necessarily uncertain. The reason that they don't have surrounding hom-ref regions can just be that this place has low depth even in just one sample across the Platinum Genomes pedigree (i.e. the calls probably are fine in most of the other samples). This means that the call (and also the surrounding +/- N bases) are probably confident/good variant calls. We are working on an automated procedure to improve the callset by adding more confident regions around such variants in the samples where we are certain that these are good calls (which should probably cover a very large part of these cases).

In some way, this is also why I think that breaking out genotyping FPs into FPs where the genotype is wrong ("overcalls" / "undercalls") and FPs where we make a call in a place that should be hom-ref is important at least in the case of Platinum Genomes as a truth set: the way that genotypes for variant calls are validated is different from the way that we validate homref regions.

Of course, being able to incorporate the remaining cases as uncertain variants would be nice, but I am not sure this is necessary in a first version of the comparison tool standard.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/benchmarking-tools/issues/11#issuecomment-171669926 .

pkrusche commented 8 years ago

+1 sounds good to me.

pkrusche commented 8 years ago

I made a pull request to include the note above.

Following our discussion yesterday, another thing to add would be a section / md file / ... on our truth set findings from above.

We could create a separate subfolder where we also provide download links, a md description and maybe some machine-readable (TSV?) file which gives download links for each of the truth set files? I can make a template for platinum genomes. This would address #10 also.

jzook commented 8 years ago

If you could do that for Platinum Genomes, I can add similar information for GIAB. Thanks!

On Thu, Jan 28, 2016 at 6:12 AM Peter Krusche notifications@github.com wrote:

I made a pull request to include the note above.

Following our discussion yesterday, another thing to add would be a section / md file / ... on our truth set findings from above.

We could create a separate subfolder where we also provide download links, a md description and maybe some machine-readable (TSV?) file which gives download links for each of the truth set files? I can make a template for platinum genomes. This would address #10 https://github.com/ga4gh/benchmarking-tools/issues/10 also.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/benchmarking-tools/issues/11#issuecomment-176126566 .