This PR introduces a rewrite of the BenchmarkVCFs WDL used for comparing a VCF (callset) to a baseline (base) VCF. The main goals for this project were the following:
Speed up the runtime in Terra. Typical runs of BenchmarkVCFs could take anywhere from 1-6 hours, whereas the core comparison done by vcfeval takes a small fraction of that time. The original impetus for this rewrite was to peel out the most necessary tasks required to automate the comparison, without sacrificing too much functionality.
Fix bugs and add extra functionality. There was a bug in the previous code where arbitrary selection via jexl selectors did not work as advertised. This should be corrected now, using bcftools selectors instead.
Expose more granular data to top level. The comparison engine vcfeval is capable of producing lots of in depth statistics that can be useful in benchmark analysis. Previously, accessing some of this data would require either running manually yourself, or digging through different Terra directories.
Add visualization tools for processing the outputs. There is a lot of data output by this tool now, and organizing it all into some convenient plots can take some careful thought. Although the raw data should always be available to users for their pipelines, there is now a BenchmarkBoard visualizer tool that users can run by installing some Python packages and running a script in the same directory as the output files from the WDL. This comes in two flavors - single sample and multi-sample versions. Both are similar in the plots they produce. More details can be found in their documentation.
Details on the changes in design, usage instructions, and more can be found in the full README files included in this PR. I would recommend looking through that now if you're interested in learning more about how the proposed WDL works. After reading, if you're more curious about comparing with the old BenchmarkVCFs WDL from a developers perspective, keep reading for some more details.
FAQ
In order to address some design choices that might seem a bit strange at first, I'll include this FAQ here for current and future readers to help design & debugging in the future. (Perhaps I'll be the only one who "frequently asked" myself these questions when looking back at my old code, and had to keep reminding myself why I did things this way as opposed to something seemingly simpler...)
Q: Why is there a weird NONE_FILE input into the WDL that isn't supposed to be overwritten?
A: This is used in a hack to create a null File type to continue supporting the use of having an empty interval list in the main scatter to compute stats over the "Whole Genome". This is a workaround because WDL 1.0 doesn't handle null data very well.
Q: Why use gatk SelectVariants for the interval subsetting tool when bcftools is used later? Shouldn't the latter work just fine? And can't vcfeval itself allow subsetting?
A: There are a few reasons why I chose the former over the other options:
SelectVariants allows you to use a variety of different types of interval files, including Picard interval_lists, which isn't supported by bcftools.
SelectVariants allows you to toggle padding on your interval lists. When the VCFs are subset, it's possible that variants right on the boundary get excised which can affect downstream statistics if your intervals are dense but short. Exposing this as an option to the WDL allows users to tweak this as appropriate.
Although vcfeval allows for subsetting via bed files (and not general interval_lists as well) to generate the desired ROC statistics, this option won't output any counts of FNs. This might be desirable for the user worried of large numbers of variants on the boundaries, in which case any stats involving FNs can be ignored, but the current paradigm allows you to at least get all the numbers after restricting to consider.
Q: Why use bcftools to subset variant types by expression? Doesn't vcfeval allow for this, and SelectVariants as well?
A: The other tools also allow for subsetting via an expressive language. In particular, vcfeval has the --roc-expr flag, which seems perfect for the task of computing granular stats over multiple subsets, and allows for multiple instances of the flag. Unfortunately, the same problem as above arises in not counting FNs. (It also won't intersect the combination of expressions and regions to subset over, rather outputting a file for each different option.) For SelectVariants the problem is worse -- you can't subset by variant type until everything has been labeled by vcfeval! Otherwise you might filter out some variants that help resolve identical haplotypes, leading to incorrect FNs/FPs. It could have been run after to do the filtering, but bcftools stats is very fast and computes most of the relevant stats you would want to get out of the VCFs produced by vcfeval.
Q: Why isn't there a CrosscheckFingerprints step like in the old WDL?
A: Although I agree it can be dangerous to run without this step, I'm not sure it belongs to this particular WDL. Currently, both BenchmarkVCFs and FindSamplesAndBenchmark run the tool. The former uses it as a safety net, whereas the latter uses it to match a bunch of samples across two lists. I think if the analogue to FSAB wrapper of this WDL gets developed, that should run the tool and be responsible for both steps at once. Of course, adding an optional task to run in this WDL wouldn't hurt if someone wants to add it. At the moment I wanted this tool to be closer to BenchmarkVCFs than to FSAB, but would imagine a wrapper that allows for a list of samples on either side should be developed and replace this as the de facto benchmarking workflow.
More Things to Do
There are a few more things that might be useful to finish off at some point in the future. These include:
Add template inputs.json files for common use cases (e.g. benchmarking GIAB, etc.).
Add fingerprinting check on inputs as an option.
Add optional IGV Session generation.
Add ROC data plot support to the multi-sample BenchmarkBoard. Other improvements to come with future releases of Quickboard. In particular, combining the single-sample and multi-sample visualizers will probably happen once more features are readily accessible in Quickboard.
Simple Benchmark
This PR introduces a rewrite of the BenchmarkVCFs WDL used for comparing a VCF (callset) to a baseline (base) VCF. The main goals for this project were the following:
bcftools
selectors instead.Details on the changes in design, usage instructions, and more can be found in the full README files included in this PR. I would recommend looking through that now if you're interested in learning more about how the proposed WDL works. After reading, if you're more curious about comparing with the old BenchmarkVCFs WDL from a developers perspective, keep reading for some more details.
FAQ
In order to address some design choices that might seem a bit strange at first, I'll include this FAQ here for current and future readers to help design & debugging in the future. (Perhaps I'll be the only one who "frequently asked" myself these questions when looking back at my old code, and had to keep reminding myself why I did things this way as opposed to something seemingly simpler...)
Q: Why is there a weird
NONE_FILE
input into the WDL that isn't supposed to be overwritten?A: This is used in a hack to create a
null
File type to continue supporting the use of having an empty interval list in the main scatter to compute stats over the "Whole Genome". This is a workaround because WDL 1.0 doesn't handlenull
data very well.Q: Why use
gatk SelectVariants
for the interval subsetting tool whenbcftools
is used later? Shouldn't the latter work just fine? And can'tvcfeval
itself allow subsetting?A: There are a few reasons why I chose the former over the other options:
SelectVariants
allows you to use a variety of different types of interval files, including Picard interval_lists, which isn't supported bybcftools
.SelectVariants
allows you to toggle padding on your interval lists. When the VCFs are subset, it's possible that variants right on the boundary get excised which can affect downstream statistics if your intervals are dense but short. Exposing this as an option to the WDL allows users to tweak this as appropriate.vcfeval
allows for subsetting via bed files (and not general interval_lists as well) to generate the desired ROC statistics, this option won't output any counts of FNs. This might be desirable for the user worried of large numbers of variants on the boundaries, in which case any stats involving FNs can be ignored, but the current paradigm allows you to at least get all the numbers after restricting to consider.Q: Why use
bcftools
to subset variant types by expression? Doesn'tvcfeval
allow for this, andSelectVariants
as well?A: The other tools also allow for subsetting via an expressive language. In particular,
vcfeval
has the--roc-expr
flag, which seems perfect for the task of computing granular stats over multiple subsets, and allows for multiple instances of the flag. Unfortunately, the same problem as above arises in not counting FNs. (It also won't intersect the combination of expressions and regions to subset over, rather outputting a file for each different option.) ForSelectVariants
the problem is worse -- you can't subset by variant type until everything has been labeled byvcfeval
! Otherwise you might filter out some variants that help resolve identical haplotypes, leading to incorrect FNs/FPs. It could have been run after to do the filtering, butbcftools stats
is very fast and computes most of the relevant stats you would want to get out of the VCFs produced byvcfeval
.Q: Why isn't there a
CrosscheckFingerprints
step like in the old WDL?A: Although I agree it can be dangerous to run without this step, I'm not sure it belongs to this particular WDL. Currently, both BenchmarkVCFs and FindSamplesAndBenchmark run the tool. The former uses it as a safety net, whereas the latter uses it to match a bunch of samples across two lists. I think if the analogue to FSAB wrapper of this WDL gets developed, that should run the tool and be responsible for both steps at once. Of course, adding an optional task to run in this WDL wouldn't hurt if someone wants to add it. At the moment I wanted this tool to be closer to BenchmarkVCFs than to FSAB, but would imagine a wrapper that allows for a list of samples on either side should be developed and replace this as the de facto benchmarking workflow.
More Things to Do
There are a few more things that might be useful to finish off at some point in the future. These include: