broadinstitute / palantir-workflows

Utility workflows for the DSP hydro.gen team (formerly palantir)
BSD 3-Clause "New" or "Revised" License
19 stars 8 forks source link

SimpleBenchmark -- a new WDL for benchmarking VCFs #115

Closed rickymagner closed 1 year ago

rickymagner commented 2 years ago

Simple Benchmark

This PR introduces a rewrite of the BenchmarkVCFs WDL used for comparing a VCF (callset) to a baseline (base) VCF. The main goals for this project were the following:

Details on the changes in design, usage instructions, and more can be found in the full README files included in this PR. I would recommend looking through that now if you're interested in learning more about how the proposed WDL works. After reading, if you're more curious about comparing with the old BenchmarkVCFs WDL from a developers perspective, keep reading for some more details.


FAQ

In order to address some design choices that might seem a bit strange at first, I'll include this FAQ here for current and future readers to help design & debugging in the future. (Perhaps I'll be the only one who "frequently asked" myself these questions when looking back at my old code, and had to keep reminding myself why I did things this way as opposed to something seemingly simpler...)

Q: Why is there a weird NONE_FILE input into the WDL that isn't supposed to be overwritten?

A: This is used in a hack to create a null File type to continue supporting the use of having an empty interval list in the main scatter to compute stats over the "Whole Genome". This is a workaround because WDL 1.0 doesn't handle null data very well.

Q: Why use gatk SelectVariants for the interval subsetting tool when bcftools is used later? Shouldn't the latter work just fine? And can't vcfeval itself allow subsetting?

A: There are a few reasons why I chose the former over the other options:

  1. SelectVariants allows you to use a variety of different types of interval files, including Picard interval_lists, which isn't supported by bcftools.
  2. SelectVariants allows you to toggle padding on your interval lists. When the VCFs are subset, it's possible that variants right on the boundary get excised which can affect downstream statistics if your intervals are dense but short. Exposing this as an option to the WDL allows users to tweak this as appropriate.
  3. Although vcfeval allows for subsetting via bed files (and not general interval_lists as well) to generate the desired ROC statistics, this option won't output any counts of FNs. This might be desirable for the user worried of large numbers of variants on the boundaries, in which case any stats involving FNs can be ignored, but the current paradigm allows you to at least get all the numbers after restricting to consider.

Q: Why use bcftools to subset variant types by expression? Doesn't vcfeval allow for this, and SelectVariants as well?

A: The other tools also allow for subsetting via an expressive language. In particular, vcfeval has the --roc-expr flag, which seems perfect for the task of computing granular stats over multiple subsets, and allows for multiple instances of the flag. Unfortunately, the same problem as above arises in not counting FNs. (It also won't intersect the combination of expressions and regions to subset over, rather outputting a file for each different option.) For SelectVariants the problem is worse -- you can't subset by variant type until everything has been labeled by vcfeval! Otherwise you might filter out some variants that help resolve identical haplotypes, leading to incorrect FNs/FPs. It could have been run after to do the filtering, but bcftools stats is very fast and computes most of the relevant stats you would want to get out of the VCFs produced by vcfeval.

Q: Why isn't there a CrosscheckFingerprints step like in the old WDL?

A: Although I agree it can be dangerous to run without this step, I'm not sure it belongs to this particular WDL. Currently, both BenchmarkVCFs and FindSamplesAndBenchmark run the tool. The former uses it as a safety net, whereas the latter uses it to match a bunch of samples across two lists. I think if the analogue to FSAB wrapper of this WDL gets developed, that should run the tool and be responsible for both steps at once. Of course, adding an optional task to run in this WDL wouldn't hurt if someone wants to add it. At the moment I wanted this tool to be closer to BenchmarkVCFs than to FSAB, but would imagine a wrapper that allows for a list of samples on either side should be developed and replace this as the de facto benchmarking workflow.


More Things to Do

There are a few more things that might be useful to finish off at some point in the future. These include: