Open audreyqyfu opened 1 year ago
My concern is that the range of lamba values should be nontrivially less than the largest p-vaue. If the largest p-value is not close to 1, then I become concerned about whether the null hypothesis p-values are indeed Uniform(0,1), which is the assumption of the FDR methodology. If the number of p-values is so small that the largest p-value has a nontrivial probability of being much less than 1, then perhaps the user should set lambda=0 due to a small number of p-values.
The leftmost interval does not matter because when lambda=0, then pi0.est=1.
Are you suggesting a specific change to the code that will make the error avoidable and let the user know what is happening?
Thanks for the quick response. I agree that using max(p) to set up lambda is more of a quick hack to get rid of the error messages, and indeed carries the risk of violating the assumptions. We do check the histogram of the p-values every now and then, but may not be able to do that for all the scenarios where qvalue is called, especially when qvalue is embedded in a pipeline. Setting lambda=0 could be a safe alternative if the sample size is no more than just a couple hundred.
I've also tried setting nbins in tabulate() so that it includes 0 for the rightmost interval (>=0.95):
Browse[2]> summary (p)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000000 0.0000000 0.0000555 0.1704693 0.3336338 0.9191109
Browse[2]> cbind (lambda, count = tabulate(findInterval(p, vec = lambda), nbins = length (lambda)))
lambda count
[1,] 0.05 3
[2,] 0.10 2
[3,] 0.15 1
[4,] 0.20 2
[5,] 0.25 2
[6,] 0.30 3
[7,] 0.35 3
[8,] 0.40 3
[9,] 0.45 1
[10,] 0.50 2
[11,] 0.55 1
[12,] 0.60 5
[13,] 0.65 0
[14,] 0.70 1
[15,] 0.75 1
[16,] 0.80 1
[17,] 0.85 1
[18,] 0.90 4
[19,] 0.95 0
Browse[2]> pi0 <- cumsum(tabulate(findInterval(p, vec = lambda), nbins = length (lambda))[ind])/(length(p) *
+ (1 - lambda[ind]))
Browse[2]> pi0
[1] 0.0000000 0.3960396 0.3300330 0.2970297 0.2772277 0.2640264 0.2263083 0.3217822
[9] 0.3080308 0.3168317 0.3060306 0.3300330 0.3503427 0.3677511 0.3696370 0.3712871
[17] 0.3610949 0.3630363 0.3751954
This also helps get rid of the error message. Perhaps including the counts in the qvalue output could give the user a bit more idea what the distribution of the input p-values looks like?
Hi. I also recently ran into this problem, and did some additional debugging since the issue is still open. As audreyqufu points out, the error occurs in line 111 of pi0est
because tabulate()
returns a short vector (< length(ind)
) if there are no values in p that fall in the upper intervals. The extraction [ind]
then generates missing values for the absent intervals, and then they cause cumsum()
to return a vector of NA
values.
I was able to easily generate the original error using vectors of p-values drawn from runif(n, 0, 1)
, so that they fit the null model for FDR. Not surprisingly, the frequency of the error is higher with smaller n
, but the error occurs even with n = 50
, which is as far as I went and is big enough to be a plausible use case.
For what it's worth, I think that setting nbins = length(lambda)
in tabulate()
seems sensible given that the error here is a direct consequence of tabulate()'s
failure to report zero-counts at the upper end of the interval range, and not a result of the p-values being drawn from a non-uniform distribution.
Like other users, we came across these errors with our data and have done some debugging (qvalue version 2.32.0; R version 4.3.1). I think that these errors are due to unexpected behavior of
findInterval()
andtabulate()
, the two functions used inpi0est()
, which is in turn called by theqvalue()
function.Here is the input p-value vector we used:
The qvalue() calls pi0est() to estimate pi0. The latter uses findInterval() and tabulate() to count the number of p-values in a series of nonoverlapping intervals given by lambda:
The problem here is that "
tabulate(findInterval(tmp.p, vec = lambda))
" ignores the intervals at the two ends: <0.05 and >=0.95. It ignores the left interval of <0.05 because the tabulate() function ignores zeros in the input by default (see its documentation), even though there are 65 values in this interval. It ignores the right interval of >=0.95 because tmp.p contains has no value above 0.95 and therefore "findInterval(tmp.p, vec = lambda)
" returns no 19.In fact, in the R package vignette, the 605 hedenfalk p-values that are <0.05 are also ignored in the pi0 estimation. The output from debugging the pi0est() function is as follows:
It is possible that the leftmost interval (<0.05) may not be necessary for later steps in estimating pi0; this part is not entirely clear to me. But to get around the programming problem that generates the error messages, we can at least change how lambda is specified:
In principle, there is also an interval #0, which is <0 and of course has no count. We can therefore ignore this interval safely.
@ajbass suggested an upper bound of "
max(tmp.p)-0.05
" in issue #20:This upper bound results in a wider last interval. We can double check the counts in the last intervals:
These experiments suggest several ways to deal with the error messages, with a higher estimated pi0 corresponding to fewer positives:
We can also rerun the function on the hedenfalk p-values and compare the results:
Using the default lambda:
Using a revised lambda:
On this dataset, there is no qualitative difference. Again, perhaps the leftmost interval (<0.05) is not necessary in the estimation of pi0.