StoreyLab / qvalue

R package to estimate q-values and false discovery rate quantities.
109 stars 36 forks source link

Dealing with the error messages of "missing or infinite values" or "missing values and NaN's" from calling qvalue() or pi0est() #38

Open audreyqyfu opened 1 year ago

audreyqyfu commented 1 year ago

Like other users, we came across these errors with our data and have done some debugging (qvalue version 2.32.0; R version 4.3.1). I think that these errors are due to unexpected behavior of findInterval() and tabulate(), the two functions used in pi0est(), which is in turn called by the qvalue() function.

Here is the input p-value vector we used:

> tmp.p
  [1] 0.000000e+00 0.000000e+00 9.115477e-01 0.000000e+00 0.000000e+00
  [6] 3.336338e-01 0.000000e+00 2.726045e-08 3.108624e-15 0.000000e+00
 [11] 0.000000e+00 0.000000e+00 0.000000e+00 1.357958e-11 0.000000e+00
 [16] 3.856843e-03 0.000000e+00 2.925816e-04 0.000000e+00 6.495249e-12
 [21] 3.713274e-11 4.321365e-11 3.597123e-14 3.204471e-03 0.000000e+00
 [26] 9.569107e-03 4.440892e-15 2.224478e-02 0.000000e+00 0.000000e+00
 [31] 1.339865e-03 0.000000e+00 5.534483e-05 0.000000e+00 0.000000e+00
 [36] 9.760413e-05 5.216207e-06 0.000000e+00 0.000000e+00 4.793859e-01
 [41] 0.000000e+00 0.000000e+00 2.014265e-05 0.000000e+00 0.000000e+00
 [46] 1.775049e-09 2.462234e-03 0.000000e+00 0.000000e+00 6.318696e-01
 [51] 0.000000e+00 8.194909e-03 0.000000e+00 0.000000e+00 0.000000e+00
 [56] 5.049079e-10 0.000000e+00 5.062230e-01 5.554316e-05 0.000000e+00
 [61] 7.600417e-05 1.635782e-02 0.000000e+00 1.093950e-01 1.281121e-04
 [66] 0.000000e+00 6.194023e-01 0.000000e+00 1.001962e-06 6.646150e-11
 [71] 9.172546e-01 2.849969e-01 3.437511e-01 3.351566e-01 6.483649e-01
 [76] 1.255449e-01 7.011413e-01 1.962199e-01 5.366918e-02 5.416075e-01
 [81] 7.638786e-01 3.642310e-01 1.424749e-02 5.753318e-02 3.731248e-01
 [86] 9.191109e-01 9.034551e-01 6.005921e-01 2.480674e-02 4.445593e-01
 [91] 6.079983e-01 2.083319e-01 4.481868e-01 5.910949e-01 2.630329e-01
 [96] 2.480999e-01 8.275802e-01 3.632738e-01 9.944623e-02 8.778394e-01
[101] 4.098542e-01

> summary (tmp.p)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000000 0.0000000 0.0000555 0.1704693 0.3336338 0.9191109 

> sum (tmp.p < 0.05)
[1] 65
> qvalue (tmp.p)
Error in smooth.spline(lambda, pi0, df = smooth.df) : 
  missing or infinite values in inputs are not allowed
> ?qvalue
> qvalue (tmp.p, pi0.method = "bootstrap")
Error in quantile.default(pi0, prob = 0.1) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE
> qvalue (tmp.p, fdr.level=0.05, pi0.method="bootstrap", adj=1.2)
Error in quantile.default(pi0, prob = 0.1) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE

The qvalue() calls pi0est() to estimate pi0. The latter uses findInterval() and tabulate() to count the number of p-values in a series of nonoverlapping intervals given by lambda:

> lambda <- seq(0.05, 0.95, 0.05)
> lambda
 [1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75
[16] 0.80 0.85 0.90 0.95
> length (lambda)
[1] 19
> findInterval(tmp.p, vec = lambda)
  [1]  0  0 18  0  0  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 [25]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  9  0  0  0  0  0  0  0  0
 [49]  0 12  0  0  0  0  0  0  0 10  0  0  0  0  0  2  0  0 12  0  0  0 18  5
 [73]  6  6 12  2 14  3  1 10 15  7  0  1  7 18 18 12  0  8 12  4  8 11  5  4
 [97] 16  7  1 17  8
# This output means that the intervals generated by lambda are as follows:
# 0: <0.05; 1: [0.05, 0.1); 2: [0.1, 0.15), ..., 18: [0.90, 0.95), 19: >=0.95.
# Note that the above output contains 65 0s and no 19.
> tabulate(findInterval(tmp.p, vec = lambda))
 [1] 3 2 1 2 2 3 3 3 1 2 1 5 0 1 1 1 1 4
# Note that 65 is not in the output above
> length (tabulate(findInterval(tmp.p, vec = lambda)))
[1] 18

The problem here is that "tabulate(findInterval(tmp.p, vec = lambda))" ignores the intervals at the two ends: <0.05 and >=0.95. It ignores the left interval of <0.05 because the tabulate() function ignores zeros in the input by default (see its documentation), even though there are 65 values in this interval. It ignores the right interval of >=0.95 because tmp.p contains has no value above 0.95 and therefore "findInterval(tmp.p, vec = lambda)" returns no 19.

In fact, in the R package vignette, the 605 hedenfalk p-values that are <0.05 are also ignored in the pi0 estimation. The output from debugging the pi0est() function is as follows:

> debug (pi0est)
> length (hedenfalk$p)
[1] 3170
> summary (hedenfalk$p)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000032 0.0845647 0.2998155 0.3718702 0.6316112 0.9998517 
> qobj <- qvalue(p = hedenfalk$p)

Browse[2]> length (p)
[1] 3170
Browse[2]> summary (p)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000032 0.0845647 0.2998155 0.3718702 0.6316112 0.9998517 
Browse[2]> lambda
 [1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70
[15] 0.75 0.80 0.85 0.90 0.95
Browse[2]> test.interval <- findInterval(p, vec = lambda)
Browse[2]> table (test.interval)
test.interval
  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
605 263 194 190 180 154 128 130 134 120 108 101 113  83 115 118 109 122 
 18  19 
 94 109 
Browse[2]> tabulate(findInterval(p, vec = lambda))
 [1] 263 194 190 180 154 128 130 134 120 108 101 113  83 115 118 109 122
[18]  94 109

It is possible that the leftmost interval (<0.05) may not be necessary for later steps in estimating pi0; this part is not entirely clear to me. But to get around the programming problem that generates the error messages, we can at least change how lambda is specified:

> lambda <- seq(0, max(tmp.p), 0.05)
> cbind (lambda, count = tabulate(findInterval(tmp.p, vec = lambda)))
      lambda count
 [1,]   0.00    65
 [2,]   0.05     3
 [3,]   0.10     2
 [4,]   0.15     1
 [5,]   0.20     2
 [6,]   0.25     2
 [7,]   0.30     3
 [8,]   0.35     3
 [9,]   0.40     3
[10,]   0.45     1
[11,]   0.50     2
[12,]   0.55     1
[13,]   0.60     5
[14,]   0.65     0
[15,]   0.70     1
[16,]   0.75     1
[17,]   0.80     1
[18,]   0.85     1
[19,]   0.90     4

In principle, there is also an interval #0, which is <0 and of course has no count. We can therefore ignore this interval safely.

@ajbass suggested an upper bound of "max(tmp.p)-0.05" in issue #20:

> lambda <- seq(0, max(tmp.p)-0.05, 0.05)
> cbind (lambda, count = tabulate(findInterval(tmp.p, vec = lambda)))
      lambda count
 [1,]   0.00    65
 [2,]   0.05     3
 [3,]   0.10     2
 [4,]   0.15     1
 [5,]   0.20     2
 [6,]   0.25     2
 [7,]   0.30     3
 [8,]   0.35     3
 [9,]   0.40     3
[10,]   0.45     1
[11,]   0.50     2
[12,]   0.55     1
[13,]   0.60     5
[14,]   0.65     0
[15,]   0.70     1
[16,]   0.75     1
[17,]   0.80     1
[18,]   0.85     5

This upper bound results in a wider last interval. We can double check the counts in the last intervals:

> sum (tmp.p>=0.95)
[1] 0
> sum (tmp.p>=0.9)
[1] 4
> sum (tmp.p>=0.85)
[1] 5

These experiments suggest several ways to deal with the error messages, with a higher estimated pi0 corresponding to fewer positives:

> result <- qvalue (tmp.p, fdr.level=0.05, lambda = seq (0, max(tmp.p), 0.05))
> result$pi0
[1] 0.3198683
> sum (result$significant)
[1] 68

> result <- qvalue (tmp.p, fdr.level=0.05, lambda = seq (0, max(tmp.p)-0.05, 0.05))
> result$pi0
[1] 0.2835713
> sum (result$significant)
[1] 69

> result <- qvalue (tmp.p, fdr.level=0.05, lambda = seq (0, max(tmp.p), 0.05), pi0.method = "bootstrap")
> result$pi0
[1] 0.3060306
> sum (result$significant)
[1] 69

> result <- qvalue (tmp.p, fdr.level=0.05, pi0.method="bootstrap", adj=1.2, lambda = seq (0, max(tmp.p), 0.05))
> result$pi0
[1] 0.3060306
> sum (result$significant)
[1] 69

# this option uses a fixed lambda instead of a sequence of intervals
> result <- qvalue (tmp.p, fdr.level=0.05, lambda = 0.5)
> result$pi0
[1] 0.3168317
> sum (result$significant)
[1] 68
> result$lambda
[1] 0.5

We can also rerun the function on the hedenfalk p-values and compare the results:

Using the default lambda:

> qobj <- qvalue(p = hedenfalk$p, fdr.level=0.05)
> qobj$pi0
[1] 0.669926
> sum (qobj$significant)
[1] 162

Using a revised lambda:

> qobj <- qvalue(p = hedenfalk$p, fdr.level=0.05, lambda = seq (0, max(hedenfalk$p), 0.05))
> qobj$pi0
[1] 0.6681177
> sum (qobj$significant)
[1] 162

On this dataset, there is no qualitative difference. Again, perhaps the leftmost interval (<0.05) is not necessary in the estimation of pi0.

jdstorey commented 1 year ago

My concern is that the range of lamba values should be nontrivially less than the largest p-vaue. If the largest p-value is not close to 1, then I become concerned about whether the null hypothesis p-values are indeed Uniform(0,1), which is the assumption of the FDR methodology. If the number of p-values is so small that the largest p-value has a nontrivial probability of being much less than 1, then perhaps the user should set lambda=0 due to a small number of p-values.

The leftmost interval does not matter because when lambda=0, then pi0.est=1.

Are you suggesting a specific change to the code that will make the error avoidable and let the user know what is happening?

audreyqyfu commented 1 year ago

Thanks for the quick response. I agree that using max(p) to set up lambda is more of a quick hack to get rid of the error messages, and indeed carries the risk of violating the assumptions. We do check the histogram of the p-values every now and then, but may not be able to do that for all the scenarios where qvalue is called, especially when qvalue is embedded in a pipeline. Setting lambda=0 could be a safe alternative if the sample size is no more than just a couple hundred.

I've also tried setting nbins in tabulate() so that it includes 0 for the rightmost interval (>=0.95):

Browse[2]> summary (p)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000000 0.0000000 0.0000555 0.1704693 0.3336338 0.9191109 
Browse[2]> cbind (lambda, count = tabulate(findInterval(p, vec = lambda), nbins = length (lambda)))
      lambda count
 [1,]   0.05     3
 [2,]   0.10     2
 [3,]   0.15     1
 [4,]   0.20     2
 [5,]   0.25     2
 [6,]   0.30     3
 [7,]   0.35     3
 [8,]   0.40     3
 [9,]   0.45     1
[10,]   0.50     2
[11,]   0.55     1
[12,]   0.60     5
[13,]   0.65     0
[14,]   0.70     1
[15,]   0.75     1
[16,]   0.80     1
[17,]   0.85     1
[18,]   0.90     4
[19,]   0.95     0
Browse[2]> pi0 <- cumsum(tabulate(findInterval(p, vec = lambda), nbins = length (lambda))[ind])/(length(p) * 
+  (1 - lambda[ind]))
Browse[2]> pi0
 [1] 0.0000000 0.3960396 0.3300330 0.2970297 0.2772277 0.2640264 0.2263083 0.3217822
 [9] 0.3080308 0.3168317 0.3060306 0.3300330 0.3503427 0.3677511 0.3696370 0.3712871
[17] 0.3610949 0.3630363 0.3751954   

This also helps get rid of the error message. Perhaps including the counts in the qvalue output could give the user a bit more idea what the distribution of the input p-values looks like?