New tools for annotation-based filtering.

This is a meta issue to track remaining and future work for the new tools for annotation-based filtering, which will hopefully replace VQSR. Internal developers may want to see further discussion at https://github.com/broadinstitute/dsp-methods-model-prototyping/discussions/9.

ExtractVariantAnnotations:

This tool extracts annotations, labels, and other relevant metadata from variants (or alleles, in allele-specific mode) that do or do not overlap with specified resources. The former are considered labeled and each variant/allele can have multiple labels. The latter are considered unlabeled and can be randomly downsampled using reservoir sampling; extraction of these is optional.

The outputs of the tool are HDF5 files containing the extracted data for labeled and (optional) unlabeled variant sets, as well as a sites-only VCF containing the labeled variants. This VCF can be used in ScoreVariantAnnotations to in turn specify an additional "extracted" label, which can be useful for indicating those sites that were actually extracted from the provided resources (since we may only extract over a subset of the genome).

TODOs:

[x] Integration tests. Putting together tests using chr1:1-10M snippets of 1) the 1kgp-50-exomes sites-only VCF for the input (since this has both non-AS and AS annotations; EDIT: Scratch that, it only has AS_InbreedingCoeff and AS_QD), 2) the Omni SNP training/truth VCF (yielding ~3.5k training), and 3) the Mills training/truth VCF (yielding ~500 training). Incidentally, VariantRecalibrator SNP and INDEL runs both fail to converge on these small training sets without the #7709 fix, but do converge with it. I still need to check if enough multiallelics are included here; if not, I'll choose a different snippet. EDITEDIT: Now using gs://broad-gotc-test-storage/joint_genotyping/exome/scientific/truth/master/gather_vcfs_high_memory/small_callset_low_threshold.vcf.gz provided by @ldgauthier, which does have AS annotations.

We'll use expected outputs here as inputs to downstream steps, but rather than provide the expected outputs directly, we'll create copies of them and provide those as inputs. This will make the tests better encapsulated. However, it should be relatively easy to update the whole chain of test files, should one choose to do so. EDIT: Let's just provide the expected outputs directly. So it'll be even easier to update the whole chain---just set the flags for all three tools to overwrite the expected results.

We test the Cartesian product of the following options: 1) non-allele-specific vs. allele-specific, 2) SNP vs. indel vs. both, and 3) positive vs. positive-unlabeled. Downstream, we'll further subset to a subset of these options, since training/scoring functionality shouldn't really change across some of them.

I'm currently just using call outs to system commands to diff and h5diff the VCFs and HDF5s, respectively. I think the latter command should be available in the GATK Conda environment. This will be a bit awkward, in the sense that the tests for this tool will require the Conda environment, but the tool itself will not. But I think this is probably preferable to writing test code to compare HDF5s, minimal though that might be, since the schema might change in the future.
[x] Tool-level docs.

Minor TODOs:

[x] Parameter-level docs. Could perhaps expand on the resources parameter once the required labels are settled.
[x] Parameter validation.
[x] Clean up docs for parent walker.
[x] Decide on required labels. I think "training" and "calibration" (rather than the legacy "training" and "truth") might be good candidates. EDIT: Switched "truth" to "calibration" throughout the codebase.
[x] Validate privileged labels (snp, training, calibration) in parent walker.

Future work:

[ ] Clean up unlabeled outputs. This includes 1) sorting the corresponding HDF5, and 2) outputting a corresponding sites-only VCF. Unlike the labeled sites, which are written individually to VCF as we traverse them, unlabeled sites are placed into a reservoir of fixed size for subsampling purposes. Thus, we cannot write them to VCF as with labeled sites; furthermore, after traversal, the unlabeled sites are not ordered within the reservoir. Ultimately, the lack of this VCF means that extracted, unlabeled sites cannot be tagged as such by the scoring tool in the final VCF.
[ ] Consider downsampling of labeled data. This is not done because 1) of the complications just mentioned, 2) we assume that labeled data is precious and that one-time extraction of it will always be relatively cheap, especially compared to training (and that training implementations can always downsample, if needed), and 3) using -L functionality to subset genomic regions is perhaps a cleaner strategy for doing so.
[x] I think we can probably clean up treatment of allele-specific annotations by automatically detecting whether an annotation is an array type. This would obviate the need for the parameter to turn on allele-specific mode. EDIT: Added in #8131.

TrainVariantAnnotationsModel:

Trains a model for scoring variant calls based on site-level annotations.

TODOs:

[x] Integration tests. Exact-match tests for (non-exhaustive) configurations given by the Cartesian product of the following options:
- non-allele-specific vs. allele-specific
- SNP-only vs. SNP+INDEL (for both of these options, we use extracted annotations that contain both SNP and INDEL variants as input)
- positive (training with .annot.hdf5) vs. positive-unlabeled (training with .annot.hdf5 and *.unlabeled.annot.hdf5)
- Java Bayesian Gaussian Mixture Model (BGMM) backend vs. python sklearn IsolationForest backend (BGMM tests to be added once PR for the backend goes in.)
[x] Tool-level docs.

Minor TODOs:

[x] Parameter-level docs.
[x] Parameter/mode validation.
[x] Refactor main code block for model training; it's a bit monolithic and procedural now.
[x] Decide on behavior for ill-behaved annotations. E.g., all missing, zero variance.

Future work:

[ ] We could allow subsetting of annotations here, which might allow for easier treatment of ill-behaved annotations. However, I'd say enabling workflows where the set of annotations is fixed is the priority.
[ ] We could do positive-unlabeled training more rigorously or iteratively. Right now, we essentially do a single iteration to determine negative data. This could perhaps be preceded by a round of refactoring to clean up model training and make it less procedural.
[ ] Automatic threshold tuning could be built into the tool, see #7711. We'd probably have to introduce a "validation" label. Perhaps it makes sense to keep this sort of thing at the workflow level?
[ ] In the positive-negative framework enforced by the Java code in this tool, a "model" is anything that assigns a score, we fit two models to different subsets of the data, and then take the difference of the two scores. While the python backend does give some freedom to specify a model, future developers may want to go beyond the framework itself. For example, more traditional classification frameworks, etc. could be explored. As an intermediate step, one could perhaps use the positive/negative scores from the current framework in a more sophisticated way (e.g., using them as features), rather than just taking their difference. This sort of future work could be developed completely independently of the codebase associated with the current training tool (or done externally in python), but should still be able to make use of the extract and score tools, since the contracts should be relatively general. EDIT: I think I will go ahead and ablate the positive-negative option, as this adds a lot of overly complicated code for little benefit. So Java code in this tool will only be responsible for selecting variant types---and we may even want to move that functionality into backends in the future. To start, BGMM and IsolationForest backends will be positive-only, and custom Python backends will have the unlabeled annotations passed directly.

ScoreVariantAnnotations:

Scores variant calls in a VCF file based on site-level annotations using a previously trained model.

TODOs:

[x] Integration tests. Exact-match tests for (non-exhaustive) configurations given by the Cartesian product of the following options:
- Java Bayesian Gaussian Mixture Model (BGMM) backend vs. python sklearn IsolationForest backend (BGMM tests to be added once PR for the backend goes in.)
- non-allele-specific vs. allele-specific
- SNP-only vs. SNP+INDEL (for both of these options, we use trained models that contain both SNP and INDEL scorers as input)
  - [x] Tool-level docs.

Minor TODOs:

[x] Parameter-level docs.
[x] Parameter/mode validation.
[x] Double check or add behavior for handling previously filtered input, clearing present filters, etc.

Future work:

[ ] The score_samples method of the sklearn IsolationForest is single-threaded. See (possibly stalled) PR at https://github.com/scikit-learn/scikit-learn/pull/14001 and some workarounds using e.g. multiprocessing ibid.

Bayesian GMM:

This is essentially an exact port of the sklearn implementation, but only allowing for full covariance matrices. I think it might be good for those in the Bishop reading group to take a look during review.

I decided to split this off into its own branch (just updated the existing branch https://github.com/broadinstitute/gatk/tree/sl_sklearn_bgmm_port) and only include stubs for the BGMM backend in the above tools. This is so we can prioritize merging the IsolationForest implementation for @meganshand. We can easily add this module back when it's been reviewed separately.

TODOs:

[x] Class-level docs.
[x] Method-level docs. I think pointers back to the original sklearn code will suffice for most methods, but I've also included some parameter descriptions. Also note that I've retained original sklearn comments throughout the implementation and have also commented on mathematical expressions where it might be helpful.
[x] Unit tests. There's already test data (generated using Pyro) checked in and the results match the sklearn implementation to high precision, I just need to write numerical checks. There are also already unit tests for static utility methods.

Future work:

[ ] Expanding unit tests to cover more of the interface. These initial unit tests will almost certainly not completely cover the possibilities allowed by the interface, e.g. warm starts. Could be a good exercise for other developers. EDIT: At least one test of warm starts has been added.
[ ] As mentioned in the prototyping discussion, expanding this implementation to properly include marginalization might be of future interest. However, I think a very strong case would have to made before proceeding, as I think closely matching the sklearn implementation has obvious benefits for maintainability.

@samuelklee Some of our collaborators are currently working on updating CNNScoreVariants to use PyTorch -- is that project relevant to this ticket?

Thanks for the question @droazen. No, these tools are more meant to be an update to VQSR, i.e., they do not assume that the BAM/reads will be available and only use the annotations.

I think such tools will remain useful going forward, especially for joint genotyping. We can probably eventually push CNN/etc.-based generation of additional features/annotations from the BAM/reads upstream of filtering, so that they’re generated at the same time as our traditional “handcrafted” annotations, after which we can throw everything through the annotation-based filtering tools here.

Rebasing, squashing, and reorganizing files into new commits to prep for the PR, but here's a copy of the commit messages for posterity: commits-before-rebase.txt

PR Punts:

[ ] Profile and check whether interning of resource labels in the LabeledVariantAnnotationsWalker affects memory or runtime. Unfortunately, I can't remember why I added this, but maybe I had a good reason. See https://github.com/broadinstitute/gatk/pull/7954#discussion_r933375971.
[ ] Consider writing allele-specific scores and/or different strategies for consolidating to a site-level score. The current strategy of simply taking allele with the max score (across SNPs/INDELs for mixed sites, to boot) is borrowed from ApplyVQSR. See https://github.com/broadinstitute/gatk/pull/7954#discussion_r933570228.
[ ] Add behavior for dealing with mixed SNP/INDEL sites in separate passes (and note that the current WDL currently does this, to allow for the use of different annotations across SNPs and INDELs). This might include rescuing previously filtered sites, etc. (e.g., by using the option to ignore the first-pass filter in the second pass). Alternatively, one could use a different FILTER name in each pass, which downstream hard-filtering steps could utilize intelligently. Or one might just split multiallelics upstream. In any case, I would hope that we could move towards running both SNPs and INDELs in a single pass with the same annotations as the default mode of operation.
[ ] Clean up borrowed code in the VariantType class for classifying sites as SNP or INDEL. We mostly retained the VQSR code and logic to make head-to-head comparisons easier. Note also that we converted some switch statements to conditionals, etc. (which I think was done properly, but maybe I missed an edge case). See https://github.com/broadinstitute/gatk/pull/7954#discussion_r934776584.
[ ] Think more about how to treat empty HDF5 arrays. It's possible we should handle this at the WDL level with optional inputs/outputs. Likely only relevant for atypical edge cases. See https://github.com/broadinstitute/gatk/pull/7954#discussion_r934845337.

Next steps:

[ ] I'll update the BGMM branch and open a PR.
[ ] I'll start looking at implementing a simple CARROT test. We can just replicate the Cromwell/WDL test for now.
[ ] Update that initial implementation with non-trivial data and evaluation scripts. EDIT: I see that #7982 was just filed.
[ ] Implement a CARROT test with malaria data. We already have some evaluation scripts.
[x] Expand the WDL to enable additional workflow modes (positive-negative, etc.) and the tests to cover them. Right now only vanilla positive-only is enabled/covered.

A few minor issues:

[x] Change --resource <blah> to --resource:<blah> in tool-level documentation. EDIT: Added to the sl_lite_overlap branch mentioned below.
[x] The VCF writer in VariantRecalibrator has a few conditionals to allow for VCF headers without contig lines, we could do the same for the writer in LabeledVariantAnnotationsWalker. EDIT: Added to the sl_lite_overlap branch mentioned below.
[ ] Double check whether we should worry about any differences in extraction on test data (provided via email) from https://gatk.broadinstitute.org/hc/en-us/community/posts/7974912707099-VariantRecalibrator-IndexOutOfBoundsException. Probably nothing to worry about, and at least the error messaging in the new tools is more informative.
[x] We could change the strategy for checking for resource overlaps to require allele-level matching (rather than only matching on start position, as was inherited from VQSR). A quick test on malaria shows that this can reduce the number of overlaps by O(10%), but performance doesn't really change too much. Branch is already open at https://github.com/broadinstitute/gatk/tree/sl_lite_overlap
[ ] Expand the exact-match tests to cover some of these strategies, which were added separately in #8049 and merged to make a release deadline.
[x] Catch the exception in https://github.com/broadinstitute/gatk/blob/fd782504d18b56dbc266c2b3bb4eb32f21916776/src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/LabeledVariantAnnotationsWalker.java#L389 and throw the same message that is thrown in AS mode. Added in #8074.
[x] Add message to the score tool that the scores HDF5 file will not be out when the input VCF is empty (such a message is already emitted about the annotations HDF5 file). Added in #8074.
[ ] Megan suggested in the review of #8074 that dynamic disk sizing could be added to the WDL.

broadinstitute / gatk

New tools for annotation-based filtering. #7724