Build automatic evaluation of somatic CNV pipeline and establish best practices.

samuelklee commented 6 years ago

@LeeTL1220 has already made substantial progress on this front, but I think we can improve ground truth sets and expand the number and type of evaluation metrics calculated. To start, this will include:

[ ] TCGA WGS
[ ] HCC1143 WES purity series
[ ] HCC1143 reproducibility

We should decide on a small set of tools to compare against as well, including GATK CNV/ACNV.

We should look for other gold-standard somatic callsets and build the evaluation infrastructure so that these can be easily added as sources of GT.

We should also evaluate performance on germline and compare with gCNV (see #4123).

@MartonKN will need this evaluation (and may need to help design some aspects of it) for #4115.

See also #2881.

LeeTL1220 commented 6 years ago

@samuelklee You have a "see #4122" above, but that is this issue. Does the citation above need to be fixed?

samuelklee commented 6 years ago

Oops, thanks, fixed.

samuelklee commented 5 years ago

Evaluation of THCA/STAD/LUAD TCGA WGS/WES CR concordance with SNP arrays was implemented on FC last summer and showed good performance. For WES, comparisons against GATK CNV and CODEX showed comparable to highly improved performance, respectively, with minimal parameter tuning. WGS comparisons were unavailable due to limitations of competing tools.

This evaluation will be expanded to include CR/MAF concordance against PanCanAtlas ABSOLUTE results. Some curation of the samples could be performed; some batch effects were observed in some LC WGS LUAD samples. Comparisons to other tools will probably be removed for ease of maintenance. Will be adapted to fit into whatever framework arises from #4630; same goes for HCC1143 and CRSP validations.

samuelklee commented 5 years ago

Note that no segmentation or resolution parameters have been tuned yet for either performance or runtime. Some of these are very easy wins. For example, kernel-approximation-dimension is set to a default of 100, and the time for segmentation scales roughly linearly with this (documentation erroneously states that the scaling is quadratic, this should be fixed---my bad). In practice, setting this to as little as 2 seems to work OK for some cases, so we should evaluate this more rigorously. This can cut WGS segmentation down from ~10 minutes (out of the total ~60 minutes for 250bp bins, typically) to ~1 minute.

samuelklee commented 5 years ago

Revamping the existing somatic validation pipeline needs to be done before development of the TH prototype can continue.

[ ] Identify test bed of TCGA samples from various tumor types. We can mix tumor-normal samples (as I've done at the counts/allelic-counts level in preliminary evaluations of the TH prototype) to expand the effective number of samples.
[ ] Determine minimal version of current CGA ABSOLUTE pipeline (to be used as a baseline for comparison).
[ ] Generate and manually curate ABSOLUTE results and narrow samples down to those with relatively robust solutions.
[ ] Construct ModelSegments/M2 -> ABSOLUTE pipeline (will at least require minor development/tuning of ModelSegments output -> ABSOLUTE input conversion script, may also require germline tagging, see related #5804) and evaluate.
[ ] Construct ModelSegments/M2 -> TH pipeline and evaluate.
[ ] Remove unsupported code/tools. See https://github.com/broadinstitute/gatk/pull/5450#issuecomment-461431199 for a summary. We should make sure that any users that would be affected by this are notified and prepare accordingly.

broadinstitute / gatk

Build automatic evaluation of somatic CNV pipeline and establish best practices. #4122