broadinstitute / gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
BSD 3-Clause "New" or "Revised" License
33 stars 20 forks source link

Poisson regression is not robust to outliers and leads to wrong inferences in TargetCoverageSexGenotyper #1097

Closed mbabadi closed 7 years ago

mbabadi commented 7 years ago

Devin McCabe discovered a bug (read: bad model behavior) in TargetCoverageSexGenotyper. The bug was discovered by feeding the tool with coverage data on autosomes + X chromosome (no Y chromosome). Since the X chr in XX samples has 2x ploidy of X in XY samples, one expects the tool to be able to make the correct inference. However, the tool genotyped all samples as XX (see the attached figure -- left: autosome+X+Y, right:autosome+X)

unnamed

A naive calculation of the relative X ploidy, i.e. calculating X_pcov = (X_total_read_counts / autosome_total_read_count) for all samples, performing a 2-mean clustering, and dividing the X_pcov by the lower ploidy cluster mean reveals that indeed, the X conting has twice more coverage on average in XX samples: image

Further investigation shows that the wrong behavior of TargetCoverageSexGenotyper stems from the lack of robustness of Poisson regression to outliers: there are a number of targets in the X contig with anomalously high coverage (200x median!). In the absence of Y coverage data (and bias adjustment), higher ploidy genotypes are always favored (in this case, XX).

Solution: either filter read counts for outliers before calculating Poisson log likelihoods, or simply use the naive median-based ploidy estimates and perform genotyping on the estimated ploidies (rather than target-resolved read counts). The latter is proven to be robust to outliers.

Update: it turns out that the issue can be fixed by simply taking into account bait count as a multiplicative bias. Otherwise, the distribution of raw read counts is multimodal and far from Poisson: image

Correcting for bait count yields a neat over-dispersed Poisson: image

Todo:

droazen commented 7 years ago

Issue moved to broadinstitute/gatk #3015 via ZenHub