broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

Build automatic evaluation of somatic CNV pipeline and establish best practices. #4122

Open samuelklee opened 6 years ago

samuelklee commented 6 years ago

@LeeTL1220 has already made substantial progress on this front, but I think we can improve ground truth sets and expand the number and type of evaluation metrics calculated. To start, this will include:

We should decide on a small set of tools to compare against as well, including GATK CNV/ACNV.

We should look for other gold-standard somatic callsets and build the evaluation infrastructure so that these can be easily added as sources of GT.

We should also evaluate performance on germline and compare with gCNV (see #4123).

@MartonKN will need this evaluation (and may need to help design some aspects of it) for #4115.

See also #2881.

LeeTL1220 commented 6 years ago

@samuelklee You have a "see #4122" above, but that is this issue. Does the citation above need to be fixed?

samuelklee commented 6 years ago

Oops, thanks, fixed.

samuelklee commented 5 years ago

Evaluation of THCA/STAD/LUAD TCGA WGS/WES CR concordance with SNP arrays was implemented on FC last summer and showed good performance. For WES, comparisons against GATK CNV and CODEX showed comparable to highly improved performance, respectively, with minimal parameter tuning. WGS comparisons were unavailable due to limitations of competing tools.

This evaluation will be expanded to include CR/MAF concordance against PanCanAtlas ABSOLUTE results. Some curation of the samples could be performed; some batch effects were observed in some LC WGS LUAD samples. Comparisons to other tools will probably be removed for ease of maintenance. Will be adapted to fit into whatever framework arises from #4630; same goes for HCC1143 and CRSP validations.

samuelklee commented 5 years ago

Note that no segmentation or resolution parameters have been tuned yet for either performance or runtime. Some of these are very easy wins. For example, kernel-approximation-dimension is set to a default of 100, and the time for segmentation scales roughly linearly with this (documentation erroneously states that the scaling is quadratic, this should be fixed---my bad). In practice, setting this to as little as 2 seems to work OK for some cases, so we should evaluate this more rigorously. This can cut WGS segmentation down from ~10 minutes (out of the total ~60 minutes for 250bp bins, typically) to ~1 minute.

samuelklee commented 5 years ago

Revamping the existing somatic validation pipeline needs to be done before development of the TH prototype can continue.