broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

New tools for annotation-based filtering. #7724

Open samuelklee opened 2 years ago

samuelklee commented 2 years ago

This is a meta issue to track remaining and future work for the new tools for annotation-based filtering, which will hopefully replace VQSR. Internal developers may want to see further discussion at https://github.com/broadinstitute/dsp-methods-model-prototyping/discussions/9.

samuelklee commented 2 years ago

ExtractVariantAnnotations:

This tool extracts annotations, labels, and other relevant metadata from variants (or alleles, in allele-specific mode) that do or do not overlap with specified resources. The former are considered labeled and each variant/allele can have multiple labels. The latter are considered unlabeled and can be randomly downsampled using reservoir sampling; extraction of these is optional.

The outputs of the tool are HDF5 files containing the extracted data for labeled and (optional) unlabeled variant sets, as well as a sites-only VCF containing the labeled variants. This VCF can be used in ScoreVariantAnnotations to in turn specify an additional "extracted" label, which can be useful for indicating those sites that were actually extracted from the provided resources (since we may only extract over a subset of the genome).

TODOs:

Minor TODOs:

Future work:

samuelklee commented 2 years ago

TrainVariantAnnotationsModel:

Trains a model for scoring variant calls based on site-level annotations.

TODOs:

Minor TODOs:

Future work:

samuelklee commented 2 years ago

ScoreVariantAnnotations:

Scores variant calls in a VCF file based on site-level annotations using a previously trained model.

TODOs:

Minor TODOs:

Future work:

samuelklee commented 2 years ago

Bayesian GMM:

This is essentially an exact port of the sklearn implementation, but only allowing for full covariance matrices. I think it might be good for those in the Bishop reading group to take a look during review.

I decided to split this off into its own branch (just updated the existing branch https://github.com/broadinstitute/gatk/tree/sl_sklearn_bgmm_port) and only include stubs for the BGMM backend in the above tools. This is so we can prioritize merging the IsolationForest implementation for @meganshand. We can easily add this module back when it's been reviewed separately.

TODOs:

Future work:

droazen commented 2 years ago

@samuelklee Some of our collaborators are currently working on updating CNNScoreVariants to use PyTorch -- is that project relevant to this ticket?

samuelklee commented 2 years ago

Thanks for the question @droazen. No, these tools are more meant to be an update to VQSR, i.e., they do not assume that the BAM/reads will be available and only use the annotations.

I think such tools will remain useful going forward, especially for joint genotyping. We can probably eventually push CNN/etc.-based generation of additional features/annotations from the BAM/reads upstream of filtering, so that they’re generated at the same time as our traditional “handcrafted” annotations, after which we can throw everything through the annotation-based filtering tools here.

samuelklee commented 2 years ago

Rebasing, squashing, and reorganizing files into new commits to prep for the PR, but here's a copy of the commit messages for posterity: commits-before-rebase.txt

samuelklee commented 2 years ago

PR Punts:

Next steps:

samuelklee commented 2 years ago

A few minor issues: