option 1: calculate ALL p-values, even for early stopped -> seems like best option, but requires modified code logic and longer compute times for large n
option 2: only calculate confidence interval over non-early-stopped runs
Not sure whether visualisation is important anyway
Not sure whether visualisation is important anyway