Tune hyperparameters and evaluate performance of GATK gCNV

asmirnov239 commented 6 years ago

@samuelklee Here is the breakdown of the plan that came out of discussion with @mbabadi

Hyperparameter selection

We need a fully automated pipeline that runs gCNV and measures performance across different hyperparameter combinations and outputs the most optimal one. TODOs:

[ ] Write a WDL script that produces a gCNV "performance score" (such as area under ROC curve) given hyperparameter values and a "truth" VCF on a matching cohort of samples. This includes:
- [ ] Wrapping gCNV cohort mode WDL into workflow that takes in a hyperparameter value file (in JSON format) and calling evaluation script. Issue #2
- [ ] Adding and streamlining evaluation script that can take a list of VCFs and compare it to a truth VCF. Issue #3
[ ] Write WDL for hyperparameter search. Since the space of hyperparameter combinations is too big (~10000 options) we cannot run a simple grid search to optimize hyperparameters. To explore this space in a more tractable way, for the first attempt, we can uniformly sample the hyperparameter space and produce gCNV results with randomly sampled hyperparameter values from which we can infer the optimal hyperparameters. Issue #5
[ ] Because we need to produce a concordance with TCGA matched normals we cannot use it as a validation set to choose hyperparameters. Thus we agree on a hyperparameter selection procedure - for example using GPC2 or subset of chromosomes as a validation set or subset of TCGA samples.
[ ] (Optional) If time allows develop a more sophisticated hyperparameter selection procedure using a Bayesian optimization techniques.
[ ] Find values for optimal hyperparameters to be used in evaluations.

Evaluating concordance with Genome STRiP calls on TCGA WGS samples

[ ] Curate Genome STRiP calls. Issue #6
[x] Build docker for CODEX. Issue #7
[x] Write WDL for CODEX. Issue #7
[ ] Add plotting scripts for concordance. Issue #8
[ ] Run gCNV with optimal hyperparameters, and compare its concordance with CODEX and XHMM results

Evaluation using Taiwanese WES trios

[ ] Write concordance scripts. Compute copy-number transmission and de novo rates in a cohort of WES trios. Issue #9
[ ] Run gCNV and compute concordance metrics

Evaluation using CHM-HGSV WGS samples

[ ] Run gCNV with optimal hyperparameters, and use calls on PacBio long reads as truth

samuelklee commented 6 years ago

Thanks again for putting this together! This looks like a very reasonable plan.

Just to be clear, the primary goal for these preliminary evaluations is to automate the demonstration of baseline performance that is better than XHMM and CODEX for both rare and common events.

Thus, regarding the hyperparameter search, it is OK if this is not "exhaustive" at first. We should cut down the dimension of the search space by quickly settling on reasonable values for those parameters that will not strongly affect results, even if these may not be "optimal". For the other parameters, we can start by sampling corners of the hypercube. As long as we hit a point that looks sufficiently good, we're set.

We should be especially aggressive in cutting down the space for WGS, which may get expensive. We'll have to strike a good balance between number of truth events available at our chosen resolution and the cost of running at that resolution, as well.

Putting together a WDL/script that performs a more intelligent and exhaustive search is certainly important, but we will not want to really run it until more major issues like coverage collection, etc. are settled. Furthermore, work towards a more intelligent strategy will definitely be quite exploratory, so it should probably wait until after the naive search WDL/script is complete and the poster is done. One (perhaps harebrained) option I mentioned to @mbabadi would be to use a Latin hypercube to sample the parameter space and then regress the performance metric using a neural network, as is done in https://arxiv.org/abs/1708.00011 (incidentally, it looks like the arXiv is down...I don't believe I've ever seen that!)

Finally, let's touch base with Dan and Tim to see if they're willing to share their trio concordance scripts. We should still write our own, but there's no reason we can't base ours off of theirs to start.

samuelklee commented 6 years ago

Also, I think I'd slightly prefer hyperparameter selection based on a subset of chromosomes/samples, rather than an orthogonal cohort. It would be good to limit the total number of BAMs used across all evaluations as much as we can. This may have some limitations, but hopefully it'll work out.

samuelklee commented 6 years ago

@asmirnov239 I've finished the CODEX tasks and have only plotting left assigned to me. @mbabadi suggested that I throw together some prototypes for bait-based coverage tools until you guys are ready for me to do something else. But if you need help putting together the trio scripts (which is probably higher priority for the poster), I can do that instead.

broadinstitute / gatk-evaluation