WillianFuks / tfcausalimpact

Python Causal Impact Implementation Based on Google's R Package. Built using TensorFlow Probability.
Apache License 2.0
593 stars 70 forks source link

Understanding the results and improving the model #72

Open kai-majerus opened 1 year ago

kai-majerus commented 1 year ago

This post isn't about a particular problem with the package, but rather how to understand the results and improve the model. I hope this is the right place to post.

I work for a company that helps online retailers to group their inventory into google ad campaigns. I am using Causal Impact to determine whether the release of a new feature within our software had an impact on the total impressions that an online retailer received through google ads - an impression is counted each time the retailers ad is shown on a google search result page.

To begin with, I just have one X variable.

y - impressions over the past 365 days.

X - daily searches for the term ‘garden furniture’ using google trends

I expected searches for 'garden furniture' to have a good correlation with the impressions of this particular retailer (correlation was +0.58). Importantly, google search terms won't be influenced by the change we made to our software and therefore satisfies the key requirement that the X variables are not affected by the intervention.

After standardising, the data looks like this.

image

And running Causal Impact shows that the intervention did not quite have a significant effect (p=0.07).

`pre_period_start = '20220405' per_period_end = '20230207' post_period_start = '20230208' post_period_end = '20230329'

pre_period = [pre_period_start, per_period_end] post_period = [post_period_start, post_period_end]
ci = CausalImpact(dated_data, pre_period, post_period) ci.plot()`

image

image

Questions

  1. How can I verify that my X variables are doing a good job of predicting y?

tf.reduce_mean(ci.model.components_by_name['SparseLinearRegression/'].params_to_weights( ci.model_samples['SparseLinearRegression/_global_scale_variance'], ci.model_samples['SparseLinearRegression/_global_scale_noncentered'], ci.model_samples['SparseLinearRegression/_local_scale_variances'], ci.model_samples['SparseLinearRegression/_local_scales_noncentered'], ci.model_samples['SparseLinearRegression/_weights_noncentered'], ), axis=0)

<tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.06836722], dtype=float32)>

The value for beta.X = 0.06836722 seems quite low and suggests that the garden_furniture searches don't explain impressions very well. Is this the correct interpretation?

  1. When adding another X variable to the model, how can I determine whether adding that variable was useful or not?

  2. I’ve also attempted to backtest the model by selecting the first 90 data points and an imaginary intervention date. As shown below, we do not get a significant effect. However, I’m concerned that the predictions don’t seem to align that closely with the actual y. Does this look like a problem?

image

  1. General advice - Any suggestions on improving the analysis would be greatly appreciated as this is the first time I’ve used Causal Impact. In particular, I'm struggling on
WillianFuks commented 1 year ago

Hi @kai-majerus ,

Let's see if I can help at least a bit:

  1. The more rigorous approach here would probably be to run statistical hypothesis tests to see if the influence of your keyword trends are statistically significant. We don't have that available in TFP (which this library builds upon) so we are left with printing and plotting these values to see how it's behaving overall.

The code that you used as example are the ones available in the getting started notebook, I probably should improve those already as we have some better features now.

In order to print and plot values associated to what the model converged to, I recommend using this notebook as a reference.

You could, for instance, print each parameter mean and standard deviation with something like:

for param in ci.model.parameters:
  print("{}: {} +- {}".format(param.name,
                              tf.mean(ci.model.samples[param.name], axis=0),
                              tf.std(ci.model.samples[param.name], axis=0)))

(This is just an example, I'm not with access to my workstation right now so I can't confirm).

Another interesting approach would be to decompose the model and plot each component as well:

component_dists = sts.decompose_by_component(
    ci.model,
    observed_time_series=your_Y_data,
    parameter_samples=ci.model.samples)

forecast_component_dists = sts.decompose_forecast_by_component(
    ci.model,
    forecast_dist=ci.model.forecast,
    parameter_samples=ci.model.samples)

And use those "dists" as input for plotting functions.

Funny thing (and quite coincidentally) the second example in this colab also finds a temperature effect (their "X") of 0.06 just like yours:

Inferred parameters:
observation_noise_scale: 0.007361857686191797 +- 0.0015756225911900401
hour_of_day_effect_drift_scale: 0.002189198974519968 +- 0.0007748338975943625
day_of_week_effect_drift_scale: 0.012116787023842335 +- 0.018613167107105255
*temperature_effect_weights: [0.06205676] +- [0.00406885]*
autoregressive_coefficients: [0.9839601] +- [0.00560336]
autoregressive_level_scale: 0.14477737247943878 +- 0.0036965550389140844

Notice the temperature effect has a std of just 0.004 so the 95% interval wouldn't cross the 0 (zero) threshold by any margin (which hints as the weight being statically significant in the frequentist interpretation).

If you check their component plots you'll see that the temperature component is also quite relevant when compared to the others, so it's definitely helping out in the forecasting procedure (notice the day_of_week_effect varies between -0.1 to +0.1 which is much lower than temperature variation and impact for instance).

So I'd recommend plotting and printing out all those values and comparing them and their impact overall. If you observe your X variable is not adding much relatively speaking to the other components (like the "day_of_week_effect" in the example mentioned) then it may be a sign it's indeed not helping much.

Notice this is not entirely a rigorous approach but at least it uses the posterior samples and values to guide you out and gives some ideas on what is working or not.

  1. I recommend doing the same approach as already discussed. If you add a new variable, print and plot everything and see that not much has changed then it's an indication the new X was not helpful. If its std is also too wide (specially if it crosses the 0 threshold, i.e., contains negative and positive values) it'd indicate the new X is not helpful.

  2. Back-testing is also a great way to confirm if the model is working. What you observed is not necessarily a problem. Notice that most of the points fall withing the 95% credible interval which indicates the forecasting is properly tracking the outcomes. The flat straight line for the mean is an indication that only the local level component is helping the forecasting though. If the X variable is not changing much or is not helping much then this line tends to be presented this way indeed.

Further approaches you could take is to add seasonal components (maybe at day level, week level, month level, year level), test auto-regressive components to see if it helps on tightening the bounds of the residuals variance and play with adding other models (such as level trend models). I'd recommend using the variational inference for that as HMC will be quite slow.

  1. Those points you're struggling with is quite common and tend to be pain points indeed for causal inference. Finding valuable covariates is usually a challenge. One thing I noticed is that you said you did a software release and used keyword trends to investigate ad prints. Don't know precisely what that means but usually we'd expect that marketing investments is the variable that does change ad prints. Maybe you mean a new website release and are observing organic traffic?

Also we observe at moments in your data where there's some disconnect between impressions and keywords, such as around September and December. Maybe those are hot periods for this type of furniture, you could try and a dummy linear regression that is 0 everywhere except those periods and see if it helps in prediction quality as well.

Either way, finding good covariates will remain a challenge indeed. This phase tends to be quite empirical and it's recommended to test a lot of ideas to see which works best (this library removes automatically covariates which doesn't offer much value to the final predictions so you can add more to see what happens).

As for changing frequency of data, I'd suggest -- as usual -- to give it a try and see how it goes. By aggregating the data the algorithm may find more signal in all noise but seasonal components will be removed, so a trade-off is usually expected.

You can use plots, print converged values and back-testing techniques to guide you out on which is the best model for you (on other packages, such as statsmodels, we have available goodness of fitness metrics but those are not available here).

Hope that helps. If you can you could send me an example of your standardized data with white noise and I could play around with this data as well to see what I find.

Hope that helps, I'll try in the next week or two to improve this documentation in this repo, it may make it more helpful after all.

Let me know if this helps,

Best,

Will

kai-majerus commented 1 year ago

@WillianFuks Thanks for the great reply - really helpful. I'm currently working on something else, but will return to this work in the next few months and try all of your suggestions.

Here is the standardized data with some white noise added as noise = np.random.normal(mu=0, sigma=0.1, standardized_data.shape) Dataset 1 - Sheet1.csv

Another interesting idea suggested in the original paper, is to use groups of google trends search terms as covariates to proxy industry verticals. So instead of just using searches for garden furniture as a covariate, you could use the sum of searches for garden furniture, outdoor furniture, garden table, garden chair as one industry vertical.

As another example, I have a standardized dataset for a retailer that sells cookware. Y is again their impressions on google ads, and I have 5 covariates, where each covariate is the sum of google trends searchs for the terms in the group. Each group tries to capture a different industry vertical.

These are the groups

group_1 = ['pots', 'pans', 'bakeware', 'baking tray', 'roasting tray', 'frying pan', 'wok']
group_2 = ['kitchen utensils', 'knives', 'chopping board', 'wooden spoon', 'whisk']
group_3 = ['food processor', 'coffee machine', 'kettle', 'blender', 'toaster', 'scale', 'microwave']
group_4 = ['tableware', 'silverware', 'plates', 'bowls', 'glasses', 'cutlery', 'mugs', 'tupperware']
group_5 = ['cookbooks', 'recipe book']

Here is that dataset if you want to have a play around Dataset 2 - Sheet1.csv

This is the extract from the paper - section 4, analysis 2.

An important characteristic of counterfactual-forecasting approaches is that they do not require a setting in which a set of controls, selected at random, was exempt from the campaign. We therefore repeated the preceding analysis in the following way: we discarded the data from all control regions and, instead, used searches for keywords related to the advertiser’s industry, grouped into a handful of verticals, as covariates. In the absence of a dedicated set of control regions, such industry related time series can be very powerful controls, as they capture not only seasonal variations but also market-specific trends and events (though not necessarily advertiser-specific trends). A major strength of the controls chosen here is that time series on web searches are publicly available through Google Trends (http://www.google.com/trends/). This makes the approach applicable to virtually any kind of intervention. At the same time, the industry as a whole is unlikely to be moved by a single actor’s activities. This precludes a positive bias in estimating the effect of the campaign that would arise if a covariate was negatively affected by the campaign.