This commit introduces the possibility to specify more flexible hypothesis tests
for each of the power utilities. The calculations used to default to the following:
H0: epsilon = 0 vs H1: epsilon != 0, with epsilon the difference in means between
treatment and control. Now, via the introduction of the alternative and mu
parameters, you can speficiy hypothesis tests such as:
(mu=delta) => H0: epsilon = delta vs epsilon != delta
I also introduced several simplification for consistency:
Across all utilities I defaulted to using a normal approximation for
the test statistics (instead of the exact t-distribution). Indeed, `solveforsample_{T,G}test
was already making this assumption (by fixing the degrees of freedom to a large number)
so the simplification should make the utilities more consistent
In solveforpower_Gtest I opted for a test with unequal variances and didn't
fallback to the equal variance assumption when mu=0 (for simplicity)
There was also a mistake in solveforeffectsize_Ttest where we would return the square
of the correct value.
The commit also introduces a tentative test suite which is far from being exhaustive
but which should get us started. Specifically I test each utility by using their
output as parameters for simulated experiments. I then compare the observed power
obtained with the theoretical power doing an equivalence test with a 0.01 margin
(a two one-sided test). The margin is chosen so that the equivalence test has
adequate power (0.999) and false positive rate (0.001) without being computationally
prohibitive (30000 experiments, as given by -in R with the TOSTER library-:
ceiling(powerTOSTone.raw(alpha = 0.001, statistical_power = 0.999, sd = sqrt(0.8*0.2),
low_eqbound = -0.015, high_eqbound = 0.015)))
I test most of the utilities by simulating bernoulli distributed metrics (except
the ones which are radically different for binary vs continuous metrics) instead
of normally distributed ones for speed (one draw of binomial accounts for a whole
experiment)
This commit introduces the possibility to specify more flexible hypothesis tests for each of the power utilities. The calculations used to default to the following: H0: epsilon = 0 vs H1: epsilon != 0, with epsilon the difference in means between treatment and control. Now, via the introduction of the
alternative
andmu
parameters, you can speficiy hypothesis tests such as:I also introduced several simplification for consistency:
solveforpower_Gtest
I opted for a test with unequal variances and didn't fallback to the equal variance assumption when mu=0 (for simplicity)There was also a mistake in
solveforeffectsize_Ttest
where we would return the square of the correct value.The commit also introduces a tentative test suite which is far from being exhaustive but which should get us started. Specifically I test each utility by using their output as parameters for simulated experiments. I then compare the observed power obtained with the theoretical power doing an equivalence test with a 0.01 margin (a two one-sided test). The margin is chosen so that the equivalence test has adequate power (0.999) and false positive rate (0.001) without being computationally prohibitive (30000 experiments, as given by -in R with the TOSTER library-: ceiling(powerTOSTone.raw(alpha = 0.001, statistical_power = 0.999, sd = sqrt(0.8*0.2), low_eqbound = -0.015, high_eqbound = 0.015)))
I test most of the utilities by simulating bernoulli distributed metrics (except the ones which are radically different for binary vs continuous metrics) instead of normally distributed ones for speed (one draw of binomial accounts for a whole experiment)