broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.64k stars 580 forks source link

Refactor het genotyping code in ModelSegments. #3915

Open samuelklee opened 6 years ago

samuelklee commented 6 years ago

@davidbenjamin We might be able to share some code with contamination calculation, etc. Tangentially related, we also should unify the pileup-based tools at some point. Low priority, we can discuss after release.

davidbenjamin commented 6 years ago

@samuelklee These are good goals.

samuelklee commented 6 years ago

Making a note of #4001 here.

samuelklee commented 6 years ago

Perhaps add a mode to ModelSegments that takes in a VCF of genotyped hets for the case (rather than allelic counts for the case and optionally for the matched normal). This would remove the responsibility for genotyping hets to an external tool, which could be better suited for handling cases like high purity LOH with no matched normal (in which case the naive genotyping done by ModelSegments is inappropriate). This is somewhat related to @sooheelee's concerns in https://github.com/broadinstitute/gatk/issues/4717.

samuelklee commented 5 years ago

After looking more at the data from an hg38 NovaSeq run by @kcibul, I think a better strategy would be to use the normal allelic counts as a prior on whether a site is homozygous, rather than hard filtering on these sites (and pulling down corresponding counts from the tumor---this strategy was held over from GetHetCoverage/AllelicCapSeg). The main reason is that the normal will typically be sequenced at lower coverage (~30x), so this strategy will cause us to miss obvious hets in the tumor (~80x).

This is now relevant for two reasons: 1) it seems that we will want to run the filter with more stringent parameters, as higher base error rates are causing homs to leak past the filter, which in turn affects the fit of the allele-fraction model (which only attempts to model hets) by biasing normal segments towards unbalanced, and 2) we now want to run ModelSegments separately on the normal to allow for the filtering of germline events. So we want to be more stringent with low-coverage normals without affecting our high-coverage tumors.

For example, here's some hg38 NovaSeq FFPE WGS data from a ~40x normal:

download

Compare to an hg19 TCGA WGS ~40x normal:

download 1

The hom-ref tail in the first plot is much fatter and clearly leaks into the het cloud. Also curious is that the het cloud is far less binomial (or even beta-binomial---note also the absence of the tail extending to the origin).

I am still not sure why the incoming data looks different. There are several confounding factors: NovaSeq vs. HiSeq, hg38 vs. hg19, AF > 2% gnomAD sites vs. AF > 10% 1000G sites, FFPE vs. frozen, etc. I have not seen enough examples/combinations to be able to say which are the most important factors. Changing the genotyping/filtering strategy can get around this change in the data without a corresponding change in the allele-fraction model for now, but getting the data to look as good as possible upstream would be even better.

Another thought: would be nice if the strategy was easily compatible with an eventual implementation of multi-sample segmentation, which would require that the same sites are used in both the tumor and the normal. We would want to strike a balance between maximizing the number of sites and including questionable sites from the normal.

Will add more details later. @davidbenjamin @LeeTL1220 @eitanbanks @sooheelee may be interested.

LeeTL1220 commented 5 years ago

@samuelklee Are the bins in the hist2D logarithmic? Could you post an updated plot with a colorbar?

FYI... @yfarjoun

samuelklee commented 5 years ago

Sure, here you go:

kristian tcga

samuelklee commented 4 years ago

@bhanugandham @fleharty this issue touches upon our discussion of https://gatkforums.broadinstitute.org/gatk/discussion/24335/loh-detection-using-gatk4s-somatic-cnv-workflow. We might consider just a simple modification of the genotyping step (e.g., keeping all ROHs longer than a hard threshold) to start, which would probably cover the most common use cases with minimal effort. Can use 100% HCC1143 in tumor-only mode as an initial test, but it would be good to collect other examples.