Revamp TOST computation

This is for our new "two one-sided tests" equivalence check procedure. It uses t-tests for now; it may be possible to adapt to Wilcoxon tests.

Looks like SciPy itself doesn't have a one-tailed t-test implementation with a "difference threshold" value, so I turned to a different library called statsmodels. I tried to comment heavily to explain my reasoning behind the two tests.

Here's what the output currently looks like for the data set I currently have:

phong
Means  : GLSL 207.32 ± 0.09  Gator 207.09 ± 0.10
 Diff  :  0.23
 Ttest : 0.088 
 Wilcox: 0.097 
 TOST  : smaller 0.000 larger 0.000
 TOST p: 0.000 *
---------
reflection
Means  : GLSL 225.66 ± 0.11  Gator 225.91 ± 0.11
 Diff  : -0.25
 Ttest : 0.109 
 Wilcox: 0.204 
 TOST  : smaller 0.000 larger 0.000
 TOST p: 0.000 *
---------
shadow_map
Means  : GLSL 609.83 ± 0.61  Gator 616.30 ± 0.66
 Diff  : -6.46
 Ttest : 0.000 *
 Wilcox: 0.000 *
 TOST  : smaller 1.000 larger 0.000
 TOST p: 1.000 
---------
texture
Means  : GLSL 88.51 ± 0.27  Gator 88.69 ± 0.27
 Diff  : -0.18
 Ttest : 0.646 
 Wilcox: 0.563 
 TOST  : smaller 0.016 larger 0.001
 TOST p: 0.016 *
---------

A few things to note here:

I'm printing out the mean difference in frame rate to help guide intuition. It helps show, for example, that there is actually a nontrivial difference for shadow_map.
The test I'm currently running checks whether there is a greater than 1.0 difference between the two means (i.e., H0 is that they differ by one fps).
The asterisk next to the TOST p-value indicates statistical significance. I applied a p-value threshold of α=0.05. In this setup, the test succeeds for everything but shadow_map, where the fps difference is obviously bigger than 1.

cucapra / gator

Revamp TOST computation #107