Modifying the power utilities to support more flexible hypothesis tests

This commit introduces the possibility to specify more flexible hypothesis tests for each of the power utilities. The calculations used to default to the following: H0: epsilon = 0 vs H1: epsilon != 0, with epsilon the difference in means between treatment and control. Now, via the introduction of the alternative and mu parameters, you can speficiy hypothesis tests such as:

(alternative='greater', mu=delta) => H0: epsilon <= delta vs epsilon > delta
(alternative='lower', mu=defta) => H0: epsilon >= delta vs epsilon < delta
(mu=delta) => H0: epsilon = delta vs epsilon != delta

I also introduced several simplification for consistency:

Across all utilities I defaulted to using a normal approximation for the test statistics (instead of the exact t-distribution). Indeed, `solveforsample_{T,G}test was already making this assumption (by fixing the degrees of freedom to a large number) so the simplification should make the utilities more consistent
In solveforpower_Gtest I opted for a test with unequal variances and didn't fallback to the equal variance assumption when mu=0 (for simplicity)

There was also a mistake in solveforeffectsize_Ttest where we would return the square of the correct value.

The commit also introduces a tentative test suite which is far from being exhaustive but which should get us started. Specifically I test each utility by using their output as parameters for simulated experiments. I then compare the observed power obtained with the theoretical power doing an equivalence test with a 0.01 margin (a two one-sided test). The margin is chosen so that the equivalence test has adequate power (0.999) and false positive rate (0.001) without being computationally prohibitive (30000 experiments, as given by -in R with the TOSTER library-: ceiling(powerTOSTone.raw(alpha = 0.001, statistical_power = 0.999, sd = sqrt(0.8*0.2), low_eqbound = -0.015, high_eqbound = 0.015)))

I test most of the utilities by simulating bernoulli distributed metrics (except the ones which are radically different for binary vs continuous metrics) instead of normally distributed ones for speed (one draw of binomial accounts for a whole experiment)

bookingcom / powercalculator

Modifying the power utilities to support more flexible hypothesis tests #10