UW-GAC / analysis_pipeline

TOPMed analysis pipeline
52 stars 31 forks source link

Modify thinPoints to handle cases when extreme p-values exist #27

Open amstilp opened 4 years ago

amstilp commented 4 years ago

When there are variants with very extreme p-values, variants with more modest but still significant variants can get thinned. I saw this in an actual analysis, and here's an reproducible example showing what happens:

library(dplyr)
library(TopmedPipeline)

set.seed(123)
# Random p-values representing the majority of variants.
dat <- data.frame(
  pval = runif(10000)
) %>%
  # A "hit" region with modest p-values.
  bind_rows(
    data.frame(
      pval = 10^(-1 * runif(5, min = 6, max = 8))
    )
  ) %>%
  # A really strong hit, possibly an artifact but not necessarily.
  bind_rows(
    data.frame(
      pval = 10^(-1 * runif(5, min = 100, max = 120))
    )
  ) %>%
  mutate(
    obs = pval
  ) %>%
  arrange(obs) %>%
  mutate(
    x = row_number(),
    exp = x / n(),
    upper = qbeta(0.025, x, rev(x)),
    lower = qbeta(0.975, x, rev(x)),
    logp = -log10(pval)
  )

# Thin points using standard function.
thinned <- thinPoints(dat, "logp", n = 100, nbins = 10)
dat$selected <- dat$x %in% thinned$x
dat$bonferroni <- dat$pval < 0.05 / nrow(dat)

# Look at the set of variants that were not selected by thinPoints but are bonferroni-significant.
# It's the hit region with modest p-values.
dat %>% as_tibble() %>% filter(!selected & bonferroni)
#> # A tibble: 5 x 9
#>         pval        obs     x      exp   upper   lower  logp selected bonferroni
#>        <dbl>      <dbl> <int>    <dbl>   <dbl>   <dbl> <dbl> <lgl>    <lgl>     
#> 1    1.82e-8    1.82e-8     6 0.000599 2.20e-4 0.00117  7.74 FALSE    TRUE      
#> 2    2.20e-7    2.20e-7     7 0.000699 2.81e-4 0.00130  6.66 FALSE    TRUE      
#> 3    2.24e-7    2.24e-7     8 0.000799 3.45e-4 0.00144  6.65 FALSE    TRUE      
#> 4    2.39e-7    2.39e-7     9 0.000899 4.11e-4 0.00157  6.62 FALSE    TRUE      
#> 5    5.61e-7    5.61e-7    10 0.000999 4.79e-4 0.00171  6.25 FALSE    TRUE

These variants are important, but they aren't selected because they fall into a bin with a lot of other variants with less significant p-values.

One suggestion for how to fix this is to modify thinPoints keep all variants with p < p_threshold, and then applying the thinning to only those with p > p_threshold. I'm not sure what value p_threshold should be, though -- 5e-8? 5e-9? bonferroni significance?

mconomos commented 4 years ago

I like the idea of keeping all variants with p < p_threshold, but I think the threshold would need to be bigger than 5e-8 to avoid the weird "gaps" in the plots. I think we could actually use a value like 1e-4 and be OK. Under the null, the p-values are uniformly distributed on (0,1), so even with 40,000,000 variants we'd only expect about 4,000 to have p < 1e-4. Of course, it will be more if there is actual signal, so maybe we shouldn't go quite that high, but I don't think it should be genome-wide significance level.

Another option would just be to increase thin_nbins from 10 to something like 50. I've seen that work OK in other analyses, but your suggestion may be the better solution.

amstilp commented 4 years ago

I remember thinking about this for OLGA analyses. It looks like we used p = 1e-2 as a threshold there, which seems large to me. We could also think about a dynamic threshold, something like 0.05 / n_variants * factor, and choose the default factor such that the threshold is ~1e-4 for a genome-wide (single variant?) analysis. A dynamic threshold might work better when running a single chromosome, for example.

mconomos commented 4 years ago

A dynamic approach seems reasonable to me. Based on your formula, if we assume 5e-8 comes from 1,000,000 variants, then the factor is 2000. Or we could just say 100/n_variants.

With 1,000,000 variants the threshold would be 1e-4 With 10,000,000 variants it would be 1e-5 with 100,000,000 variants (I think this is roughly TOPMed scale) it would be 1e-6

Does this seem reasonable? Probably? This should always correspond to an expectation of 100 variants below this threshold under the null hypothesis. Of course more variants when there are signals, but that makes sense. We should probably test on a couple of examples.

DanielEWeeks commented 2 years ago

We encountered this thinPoints issue too. Have any steps been taken to resolve this issue?

Instead of dynamic approach, I'd favor a controllable approach, where we can set our own -log10(p) threshold in the config file above which no thinning at all of the most significant results is done. That is, I want to be sure that the resulting plot shows us all signals above the chosen threshold.