WillianFuks / tfcausalimpact

Python Causal Impact Implementation Based on Google's R Package. Built using TensorFlow Probability.
Apache License 2.0
593 stars 70 forks source link

comparison of output (impact$series$cum.effect) in Python and R packages #60

Open rj678 opened 1 year ago

rj678 commented 1 year ago

thanks for the great effort in keeping this library updated.

I'm working on converting an R library to Python, and the R library has the following line of code:

preperiod <- subset(impact$series, cum.effect == 0)

where impact is the output object of the CausalImpact library.

From what I can tell:

impact$series$cum.effect in R is computed in impact.inferences.post_cum_effects_means in python.

I used the comparison example that you have provided in the README (with comparison_data.csv), but I'm getting different output. From the R library, the values of impact$series$cum.effect start with zero in the earlier dates, whereas it is NaN in the Python package, the values for the later dates differ as well.

I'd greatly appreciate some feedback on comparing the output so I can covert the following line of code to Python appropriately:

preperiod <- subset(impact$series, cum.effect == 0)

I tried both methods: hmc and vi, and the output of the other columns in impact$series is different from impact.inferences in python as well.

thank you and looking forward to hearing back from you

WillianFuks commented 1 year ago

Hi @rj678 ,

The preperiod as given by your assignment would be computed in Python by something like:

preperiod = ci.inferences['post_cum_effects_means'][ci.inferences['post_cum_effects_means'].isna()]

Which essentially retrieves completed predictions of training data. In R package the empty values were assigned as "zeroes" whereas in Python, as they don't exist, remained as NaN.

Notice also that if you want to work with pre_period data it's also available in the ci object in ci.pre_data or ci.normed_pre_data (the latter is same data but with normalization applied).

As for varying results, did the results you observed differ too much from the official README report? I just ran it here and had very close results — using hmc method. They will never be the same as the algorithm behind is not deterministic but they should always converge to the same conclusions and be very close for the most part.

Results are expected to change from the original R package as well but again they should lead to same conclusions and be similar overall. The cumulative field will differ more as it sums up all estimated points in post period.

Let me know if this helps you,

Best,

Will

rj678 commented 1 year ago

thanks so much for confirming that the empty values are zero in in R, and NaN in Python - from what I remember, the difference between the non-zero values in impact$series$cum.effect and ci.inferences['post_cum_effects_means'] was not insignificant - I'll check again and get back, thanks so much for the detailed response.