Open audreyqyfu opened 1 year ago
My concern is that the range of lamba values should be nontrivially less than the largest p-vaue. If the largest p-value is not close to 1, then I become concerned about whether the null hypothesis p-values are indeed Uniform(0,1), which is the assumption of the FDR methodology. If the number of p-values is so small that the largest p-value has a nontrivial probability of being much less than 1, then perhaps the user should set lambda=0 due to a small number of p-values.
The leftmost interval does not matter because when lambda=0, then pi0.est=1.
Are you suggesting a specific change to the code that will make the error avoidable and let the user know what is happening?
Thanks for the quick response. I agree that using max(p) to set up lambda is more of a quick hack to get rid of the error messages, and indeed carries the risk of violating the assumptions. We do check the histogram of the p-values every now and then, but may not be able to do that for all the scenarios where qvalue is called, especially when qvalue is embedded in a pipeline. Setting lambda=0 could be a safe alternative if the sample size is no more than just a couple hundred.
I've also tried setting nbins in tabulate() so that it includes 0 for the rightmost interval (>=0.95):
Browse[2]> summary (p)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000000 0.0000000 0.0000555 0.1704693 0.3336338 0.9191109
Browse[2]> cbind (lambda, count = tabulate(findInterval(p, vec = lambda), nbins = length (lambda)))
lambda count
[1,] 0.05 3
[2,] 0.10 2
[3,] 0.15 1
[4,] 0.20 2
[5,] 0.25 2
[6,] 0.30 3
[7,] 0.35 3
[8,] 0.40 3
[9,] 0.45 1
[10,] 0.50 2
[11,] 0.55 1
[12,] 0.60 5
[13,] 0.65 0
[14,] 0.70 1
[15,] 0.75 1
[16,] 0.80 1
[17,] 0.85 1
[18,] 0.90 4
[19,] 0.95 0
Browse[2]> pi0 <- cumsum(tabulate(findInterval(p, vec = lambda), nbins = length (lambda))[ind])/(length(p) *
+ (1 - lambda[ind]))
Browse[2]> pi0
[1] 0.0000000 0.3960396 0.3300330 0.2970297 0.2772277 0.2640264 0.2263083 0.3217822
[9] 0.3080308 0.3168317 0.3060306 0.3300330 0.3503427 0.3677511 0.3696370 0.3712871
[17] 0.3610949 0.3630363 0.3751954
This also helps get rid of the error message. Perhaps including the counts in the qvalue output could give the user a bit more idea what the distribution of the input p-values looks like?
Like other users, we came across these errors with our data and have done some debugging (qvalue version 2.32.0; R version 4.3.1). I think that these errors are due to unexpected behavior of
findInterval()
andtabulate()
, the two functions used inpi0est()
, which is in turn called by theqvalue()
function.Here is the input p-value vector we used:
The qvalue() calls pi0est() to estimate pi0. The latter uses findInterval() and tabulate() to count the number of p-values in a series of nonoverlapping intervals given by lambda:
The problem here is that "
tabulate(findInterval(tmp.p, vec = lambda))
" ignores the intervals at the two ends: <0.05 and >=0.95. It ignores the left interval of <0.05 because the tabulate() function ignores zeros in the input by default (see its documentation), even though there are 65 values in this interval. It ignores the right interval of >=0.95 because tmp.p contains has no value above 0.95 and therefore "findInterval(tmp.p, vec = lambda)
" returns no 19.In fact, in the R package vignette, the 605 hedenfalk p-values that are <0.05 are also ignored in the pi0 estimation. The output from debugging the pi0est() function is as follows:
It is possible that the leftmost interval (<0.05) may not be necessary for later steps in estimating pi0; this part is not entirely clear to me. But to get around the programming problem that generates the error messages, we can at least change how lambda is specified:
In principle, there is also an interval #0, which is <0 and of course has no count. We can therefore ignore this interval safely.
@ajbass suggested an upper bound of "
max(tmp.p)-0.05
" in issue #20:This upper bound results in a wider last interval. We can double check the counts in the last intervals:
These experiments suggest several ways to deal with the error messages, with a higher estimated pi0 corresponding to fewer positives:
We can also rerun the function on the hedenfalk p-values and compare the results:
Using the default lambda:
Using a revised lambda:
On this dataset, there is no qualitative difference. Again, perhaps the leftmost interval (<0.05) is not necessary in the estimation of pi0.