Lakens / TOSTER

TOST equivalence test and power functions
35 stars 18 forks source link

Unclear why `power_twoprop()` gives different sample size than seems to be required #78

Closed wviechtb closed 5 months ago

wviechtb commented 1 year ago

I was interested in using power_twoprop() for a sample size calculation for an equivalence test. Say the true proportions are .28 and .33, my margin of error is .10, and I want 80% power with alpha=.05. With:

TOSTER::power_twoprop(
   p1    = .28,
   p2    = .33,
   null  = .10,
   alpha = .05,
   power = .80,
   alternative = "equivalence"
)

the required sample size is calculated to be 1447.962. This did not match the value (n=1048) in Table 2 in Walker and Nowacki (2011) (https://doi.org/10.1007/s11606-010-1513-8), which was calculated based on the commercial PASS software package (which costs a measly $2,395 for a perpetual license -- academic pricing!).

So I did a quick simulation to see how often the bounds of the 90% CI fall within (-margin, margin):

simpower <- function(p1, p2, n, margin, alpha=.05) {
   iters <- 10000
   sig <- rep(NA, iters)
   for (i in 1:iters) {
      x1     <- rbinom(n, 1, p1)
      x2     <- rbinom(n, 1, p2)
      pr1    <- mean(x1)
      pr2    <- mean(x2)
      se     <- sqrt(pr1*(1-pr1)/n+pr2*(1-pr2)/n)
      crit   <- qnorm(1-alpha, lower.tail=TRUE)
      ci.lb  <- (pr1-pr2) - crit * se
      ci.ub  <- (pr1-pr2) + crit * se
      sig[i] <- ci.lb > -margin && ci.ub < margin
   }
   round(mean(sig), 2)
}

simpower(p1=.28, p2=.33, n=1448, margin=.10)

This yields a power of around 91%, so quite a bit above the requested 80%. When running this with n=1048:

simpower(p1=.28, p2=.33, n=1048, margin=.10)

then we get indeed 80% power. So something is off here.

It seems that power_twoprop() is based on Chow et al. (2008), which is also cited for example here:

http://powerandsamplesize.com/Calculators/Compare-2-Proportions/2-Sample-Equivalence

which does give the same results as power_twoprop(). But the simulation shows that this is not giving the power of the TOST procedure as far as I can tell. Any insights into what the issue is?

arcaldwell49 commented 1 year ago

Thank you for the very detailed issue. I did base the function off of Chow. I am a little shocked that approximation performs so poorly. I will work on an update to the function, but no promises. I am in the midst of a job change and move so my time will be limited for the rest of 2023. However, I will have full access to PASS with my new job. So, I should be able to troubleshoot better in a few months...

I think PASS assumes, by default, are using a different test statistic (see page 213-4). My approach assumes users are using the twoprop_test function which doesn't utilize the Farrington-Manning or Gart-Nam statistics (PASS assumes you would). So the inaccuracy here may be based on the assumed test statistic (maybe).

Regardless, this is on my radar, and I will try to address it.

wviechtb commented 1 year ago

This might be related to how the CI inclusion approach can differ from the testing approach. There is some interesting discussion in section 3.1 of Wellek (2010) that might be related to this, but I am also pressed for time at the moment and can't delve into this more deeply. But it would be good to figure out where this discrepancy is coming from.

arcaldwell49 commented 5 months ago

Hi again,

Looking into this further today. I finally have PASS setup at my new job and was able to test this out more.

The power analysis TOSTER provides matches the output of PASS for the "Z-Test (Pooled)" (which should match the method utilized within TOSTER for analysis as well). In fact, under the parameters you have provided, the required sample size for TOST equivalence is between 1042-1083 in all combinations I ran (about 14). Most were between 1045 and 1048 with the outliers being those that included a continuity correction.

I have a copy of Welleck somewhere, so I'm going to read into these issues further.

I think your diagnosis that this is a CI approach (sometimes called assurance I think) is providing different required sample size than the statistical power (testing) approach is probably correct.

Barring further comments, I think I will close this issue. If other issues are identified I'd be happy to re-open it.