Open samuelklee opened 2 years ago
ExtractVariantAnnotations:
This tool extracts annotations, labels, and other relevant metadata from variants (or alleles, in allele-specific mode) that do or do not overlap with specified resources. The former are considered labeled and each variant/allele can have multiple labels. The latter are considered unlabeled and can be randomly downsampled using reservoir sampling; extraction of these is optional.
The outputs of the tool are HDF5 files containing the extracted data for labeled and (optional) unlabeled variant sets, as well as a sites-only VCF containing the labeled variants. This VCF can be used in ScoreVariantAnnotations to in turn specify an additional "extracted" label, which can be useful for indicating those sites that were actually extracted from the provided resources (since we may only extract over a subset of the genome).
TODOs:
[x] Integration tests. Putting together tests using chr1:1-10M snippets of 1) the 1kgp-50-exomes sites-only VCF for the input (since this has both non-AS and AS annotations; EDIT: Scratch that, it only has AS_InbreedingCoeff and AS_QD), 2) the Omni SNP training/truth VCF (yielding ~3.5k training), and 3) the Mills training/truth VCF (yielding ~500 training). Incidentally, VariantRecalibrator SNP and INDEL runs both fail to converge on these small training sets without the #7709 fix, but do converge with it. I still need to check if enough multiallelics are included here; if not, I'll choose a different snippet. EDITEDIT: Now using gs://broad-gotc-test-storage/joint_genotyping/exome/scientific/truth/master/gather_vcfs_high_memory/small_callset_low_threshold.vcf.gz provided by @ldgauthier, which does have AS annotations.
We'll use expected outputs here as inputs to downstream steps, but rather than provide the expected outputs directly, we'll create copies of them and provide those as inputs. This will make the tests better encapsulated. However, it should be relatively easy to update the whole chain of test files, should one choose to do so. EDIT: Let's just provide the expected outputs directly. So it'll be even easier to update the whole chain---just set the flags for all three tools to overwrite the expected results.
We test the Cartesian product of the following options: 1) non-allele-specific vs. allele-specific, 2) SNP vs. indel vs. both, and 3) positive vs. positive-unlabeled. Downstream, we'll further subset to a subset of these options, since training/scoring functionality shouldn't really change across some of them.
I'm currently just using call outs to system commands to diff and h5diff the VCFs and HDF5s, respectively. I think the latter command should be available in the GATK Conda environment. This will be a bit awkward, in the sense that the tests for this tool will require the Conda environment, but the tool itself will not. But I think this is probably preferable to writing test code to compare HDF5s, minimal though that might be, since the schema might change in the future.
Minor TODOs:
resources
parameter once the required labels are settled.Future work:
TrainVariantAnnotationsModel:
Trains a model for scoring variant calls based on site-level annotations.
TODOs:
Minor TODOs:
Future work:
ScoreVariantAnnotations:
Scores variant calls in a VCF file based on site-level annotations using a previously trained model.
TODOs:
Minor TODOs:
Future work:
score_samples
method of the sklearn IsolationForest is single-threaded. See (possibly stalled) PR at https://github.com/scikit-learn/scikit-learn/pull/14001 and some workarounds using e.g. multiprocessing
ibid.Bayesian GMM:
This is essentially an exact port of the sklearn implementation, but only allowing for full covariance matrices. I think it might be good for those in the Bishop reading group to take a look during review.
I decided to split this off into its own branch (just updated the existing branch https://github.com/broadinstitute/gatk/tree/sl_sklearn_bgmm_port) and only include stubs for the BGMM backend in the above tools. This is so we can prioritize merging the IsolationForest implementation for @meganshand. We can easily add this module back when it's been reviewed separately.
TODOs:
Future work:
@samuelklee Some of our collaborators are currently working on updating CNNScoreVariants
to use PyTorch -- is that project relevant to this ticket?
Thanks for the question @droazen. No, these tools are more meant to be an update to VQSR, i.e., they do not assume that the BAM/reads will be available and only use the annotations.
I think such tools will remain useful going forward, especially for joint genotyping. We can probably eventually push CNN/etc.-based generation of additional features/annotations from the BAM/reads upstream of filtering, so that they’re generated at the same time as our traditional “handcrafted” annotations, after which we can throw everything through the annotation-based filtering tools here.
Rebasing, squashing, and reorganizing files into new commits to prep for the PR, but here's a copy of the commit messages for posterity: commits-before-rebase.txt
PR Punts:
VariantType
class for classifying sites as SNP or INDEL. We mostly retained the VQSR code and logic to make head-to-head comparisons easier. Note also that we converted some switch statements to conditionals, etc. (which I think was done properly, but maybe I missed an edge case). See https://github.com/broadinstitute/gatk/pull/7954#discussion_r934776584.Next steps:
A few minor issues:
--resource <blah>
to --resource:<blah>
in tool-level documentation. EDIT: Added to the sl_lite_overlap branch mentioned below.
This is a meta issue to track remaining and future work for the new tools for annotation-based filtering, which will hopefully replace VQSR. Internal developers may want to see further discussion at https://github.com/broadinstitute/dsp-methods-model-prototyping/discussions/9.