Build automatic evaluation of gCNV pipeline and establish best practices.

samuelklee commented 6 years ago

@mbabadi has some python scripts that we can expand upon. To start, this should include:

[ ] WES evaluations using matched WGS as GT
[ ] WGS evaluations using consensus calls from SV callers as GT
[ ] Trio evaluations
[ ] Comparisons against ModelSegments, XHMM, CODEX2, CLAMMS, GenomeSTRiP, perhaps others (CANVAS?)

See also #2881.

ldgauthier commented 6 years ago

The Talkowski lab also runs cn.mops and CNVnator for read depth calling. No idea how easy those are to run or if they run on exomes.

asmirnov239 commented 6 years ago

Here is a recap of what we discussed today during the CNV meeting:

For the first round of evaluations we decided to run Germline CNV pipeline on TCGA exomes using a range of key hyperparameters (namely psi-t-scale and p-alt) and establish the base level performance metrics using output of GenomeSTRiP on matched WGS samples as ground truth.

@mbabadi could you come up with a good range of hyperparameter values that you think should be cross-validated?

In particular we need to:

Dockerize tools we will be evaluating against (XHMM, CODEX2, CLAMMS, GenomeSTRiP)
Write a WDL that runs Germline CNV that scatters across range of key hyperparameters and outputs array of VCFs
Write VCF processors for output of CLAMMS and CODEX2
Write WDLs for running XHMM, CODEX2, CLAMMS and GenomeSTRiP that output VCFs
Write WDL that takes results of the above and uses @mbabadi 's gCNV evaluation python modules(located here /dsde/working/mehrtash/gCNV_theano_eval) to output performance metrics
Decide on an automatic evaluation framework

For the next round of evaluations we need to:

Decide on appropriate metrics for evaluating performance on trios and write scripts that implement them
Expand the range of hyperparameters in search space (possibly include different bin sizes, GC vs no GC correction, and fragment mid point coverage collection vs largest fragment overlap coverage collection)
Use gnomAD subset of matched WES/WGS pairs for validation

samuelklee commented 6 years ago

Don't forget cn.mops (which will also be included in the somatic evaluations) and ModelSegments. We can also throw CNVnator in the mix. We'll run each tool on WES, WGS, or both as appropriate.

samuelklee commented 6 years ago

Also, let's use WDLs and Dockers from other groups, where available. In particular, I would hope that these are available for XHMM, GenomeSTRiP, and the tools that the Talkowski lab runs. We can discuss with @cwhelan today, but @asmirnov239 and @ldgauthier could you do some digging to see what the MacArthur lab has?

ldgauthier commented 6 years ago

I'm pretty sure the MacArthur lab doesn't have any WDLs or Dockers. Somewhere I saw some shell scripts Menachem used when he ran xHMM on ExAC, but they're probably so old they're for LSF.

samuelklee commented 5 years ago

@asmirnov239 and Jack Fu are currently developing tests using Talkowski-SV truth that will ultimately cover #5633. Should be adapted to fit into whatever framework arises from #4630.

samuelklee commented 5 years ago

@asmirnov239 it might be worth summarizing the current status, for future reference. TODOs might also be useful.

asmirnov239 commented 5 years ago

The following work has been done:

We performed a round of evaluations against XHMM and cn.MOPS on a cohort of 160 samples from SFARI project (which is described in our ASHG poster). For ground truth we used a callset generated from Talkowski lab SV pipeline on matched whole genome samples. Unfortunately, SFARI cohort is not public and cannot be used for public facing evaluations.
Some hyperparameter tweaking was necessary to achieve good performance. Hyperparameters changed were contained mostly only to psi_t parameter.
We developed a clustering procedure that is based on coverage profile at the set of targets that are highly variable across different capture kits.
We found that filtering on a QS metric on a final callset significantly boosted the specificity while lowering sensitivity insignificantly.
We developed a hyperparameter optimization framework prototype that could be used in a future for general optimizations of cost/performance parameters for all GATK pipelines.
We resolved several memory issues that came up during validations.

A few issues were encountered along the way:

The sensitivity and specificity on multiallellic (common) sites was significantly lower than on rare events.
Single target calling sensitivity was lower than 20%.
Pipeline WDL required optimization in order to handle whole genome data, however these changes were not consolidated in the official WDL.

Currently the ongoing work is focused on the following:

Improving sensitivity/specificity of calls on common regions. One solution being tested involves setting a prior for common regions derived from a high quality callset. Second solution is to set a different filtering threshold for common regions.
Consolidating validation scripts to process gCNV output and outputs of competing tools measure their performances against ground truth.
Analyzing 1000 Genomes exomes, which could be potentially used for public facing automatic evaluations.

The following items are necessary done for automatic evaluation:

Dataset + truth. We need an access to a high quality public cohort with matched whole genomes. These genomes have to have a corresponding high quality truth set generated from split-read/read-pair methods. From that cohort we need to find 50-200 relatively homogeneous samples.
An established validation workflow that outputs a set predetermined metrics that are unlikely to change in a future. Such as a sensitivity/specificity stratified by event size and allelic frequency.

broadinstitute / gatk

Build automatic evaluation of gCNV pipeline and establish best practices. #4123