broadinstitute / gatk-evaluation

1 stars 0 forks source link

Tune hyperparameters and evaluate performance of GATK gCNV #4

Open asmirnov239 opened 6 years ago

asmirnov239 commented 6 years ago

@samuelklee Here is the breakdown of the plan that came out of discussion with @mbabadi

Hyperparameter selection

We need a fully automated pipeline that runs gCNV and measures performance across different hyperparameter combinations and outputs the most optimal one. TODOs:

Evaluating concordance with Genome STRiP calls on TCGA WGS samples

Evaluation using Taiwanese WES trios

Evaluation using CHM-HGSV WGS samples

samuelklee commented 6 years ago

Thanks again for putting this together! This looks like a very reasonable plan.

Just to be clear, the primary goal for these preliminary evaluations is to automate the demonstration of baseline performance that is better than XHMM and CODEX for both rare and common events.

Thus, regarding the hyperparameter search, it is OK if this is not "exhaustive" at first. We should cut down the dimension of the search space by quickly settling on reasonable values for those parameters that will not strongly affect results, even if these may not be "optimal". For the other parameters, we can start by sampling corners of the hypercube. As long as we hit a point that looks sufficiently good, we're set.

We should be especially aggressive in cutting down the space for WGS, which may get expensive. We'll have to strike a good balance between number of truth events available at our chosen resolution and the cost of running at that resolution, as well.

Putting together a WDL/script that performs a more intelligent and exhaustive search is certainly important, but we will not want to really run it until more major issues like coverage collection, etc. are settled. Furthermore, work towards a more intelligent strategy will definitely be quite exploratory, so it should probably wait until after the naive search WDL/script is complete and the poster is done. One (perhaps harebrained) option I mentioned to @mbabadi would be to use a Latin hypercube to sample the parameter space and then regress the performance metric using a neural network, as is done in https://arxiv.org/abs/1708.00011 (incidentally, it looks like the arXiv is down...I don't believe I've ever seen that!)

Finally, let's touch base with Dan and Tim to see if they're willing to share their trio concordance scripts. We should still write our own, but there's no reason we can't base ours off of theirs to start.

samuelklee commented 6 years ago

Also, I think I'd slightly prefer hyperparameter selection based on a subset of chromosomes/samples, rather than an orthogonal cohort. It would be good to limit the total number of BAMs used across all evaluations as much as we can. This may have some limitations, but hopefully it'll work out.

samuelklee commented 6 years ago

@asmirnov239 I've finished the CODEX tasks and have only plotting left assigned to me. @mbabadi suggested that I throw together some prototypes for bait-based coverage tools until you guys are ready for me to do something else. But if you need help putting together the trio scripts (which is probably higher priority for the poster), I can do that instead.