WillianFuks / tfcausalimpact

Python Causal Impact Implementation Based on Google's R Package. Built using TensorFlow Probability.
Apache License 2.0
593 stars 70 forks source link

p-value is always less than 0.5 #73

Open DirceuVio opened 1 year ago

DirceuVio commented 1 year ago

Hey, Willian. Thanks so much for publishing the package. I've been using it a lot.

Something that really bothers me is the fact that the model always gives us p-values less than 0.5, no matter what kind of time series we use.

See the code below where I use synthetic data. Using a t-test, we get a p-value next to 1.

import numpy as np
from causalimpact import CausalImpact
rng = np.random.default_rng(1)

time_series1 = np.full(60, 3000, dtype=float)
time_series2 = np.full(60, 3000, dtype=float)
time_series1 += 5 * np.arange(60)
time_series2 += 5 * np.arange(60)

ci = CausalImpact(pd.DataFrame({'x': time_series2, 'y': time_series1}), pre_period=[0,29], post_period=[30,59])

image

Posterior Inference {Causal Impact} Average Cumulative Actual 3222.5 96675.0 Prediction (s.d.) 3206.98 (29.77) 96209.42 (893.14) 95% CI [3148.09, 3264.79] [94442.71, 97943.76]

Absolute effect (s.d.) 15.52 (29.77) 465.58 (893.14) 95% CI [-42.29, 74.41] [-1268.76, 2232.29]

Relative effect (s.d.) 0.48% (0.93%) 0.48% (0.93%) 95% CI [-1.32%, 2.32%] [-1.32%, 2.32%]

Posterior tail-area probability p: 0.37 Posterior prob. of a causal effect: 63.34%

For more details run the command: print(impact.summary('report'))

WillianFuks commented 1 year ago

Hi @DirceuVio ,

I'm not quite sure how you performed the t-test but either way the result you observed is indeed correct.

This question of the behavior of the p-value has been asked before and is something that troubles end users but as far as correctness of the system goes everything is running as expected.

After reading the explanation in the issue just mentioned I realized the explanation I gave was quite superficial, I probably should improve on that.

The reason why the upper limit for p-value being 0.5 is that it's computed by taking the minimum of all points that either surpass or not the summation of the observed y.

In other words, we try to decided whether there's impact or not by simulating lots of ys responses that tries to simulate what the response y would look like.

Then we take the summation of all those simulated time series over the time dimension and are left with the total value representing each simulation.

After that the value is compared with the summation of the observed response y itself. Some points may fall above the sum of y and some may fall below.

The procedure here is to take the minimum between each (which express either positive or negative effect):

    signal = min(
        np.sum(sim_sum > post_data_sum),
        np.sum(sim_sum < post_data_sum)
    )

If all simulated data falls underneath post_data_sum then p-value is 0 (zero). If half the points fall above then p-value is 0.5. But at this point if more points ends up above post_data_sum the minimum operator will take the minimum of both values, which will start dropping again.

That's why 0.5 is the highest possible value. It's really how this computation is performed in this library that sets this behavior altogether. This was taken from the original package and resembles a markovian sampling for finding the p-value, which also means the value will be capped at 0.5 by definition.

As for deciding whether there's impact or not this shouldn't be a source of concern. The important values tend to be around the 0.05 area where there's the indication the y response is either bigger or lower than what would be expected most of the times (or really most of the simulations) had no intervention taken place.

Let me know if this helps in your question,

Best,