WillianFuks / tfcausalimpact

Python Causal Impact Implementation Based on Google's R Package. Built using TensorFlow Probability.
Apache License 2.0
600 stars 72 forks source link

[Request] Is the A/A test function inappropriate? #35

Open MasaAsami opened 2 years ago

MasaAsami commented 2 years ago

Thank you for a very nice package.

I'm trying to switch from pycausalImapct recently, and I don't fully understand this module yet, so I'll only provide some ideas today. [Idea]. How about adding a feature like so-called A/A test?

[Contents] For the period before the true intervention:

[The following is just an image]

 impact_model = causalimpact.CausalImpact(df[["DEXUSUK", "DEXUSEU"]], PRE_PERIOD, POST_PERIOD, nseasons=[{'period': 7, 'harmonics': 2}])

list_date = list(df[: PRE_PERIOD[1]].index)
post_n = 21
test_n = 50
test_result = []
minimum_n = 50
y_col = "DEXUSUK"

# True intervention data will also be included for comparison.
postterm_estimated = pd.concat([impact_model.inferences, df[y_col]], axis=1).dropna()
postterm_estimated[f"aatest_no"] = "real_test"
postterm_estimated[f"pseudo_intervention"] = POST_PERIOD[0]
postterm_estimated["p_value"] = impact_model.p_value
postterm_estimated["train_datasize"] = len(df[: PRE_PERIOD[1]])
test_result.append(postterm_estimated)

for i in tqdm(range(1, test_n + 1)):
    if post_n * i + minimum_n >= len(list_date):
        # Minimum sample size required for learning
        break

    temp_pre_period = [list_date[0], list_date[-post_n * i]]
    temp_post_period = [list_date[-post_n * i + 1], list_date[-post_n * i + post_n - 1]]

    temp_df = df[: temp_post_period[1]][["DEXUSUK", "DEXUSEU"]]

    # TODO: I'm doing batch processing, which is very inefficient (since I'm discarding the results of the previous learning process). I'm now considering improving it to sequential learning.
    temp_impact_model = causalimpact.CausalImpact(
            temp_df,
            temp_pre_period,
            temp_post_period,
            nseasons=[{"period": 7, "harmonics": 2}],
     )

    postterm_estimated = pd.concat([temp_impact_model.inferences, temp_df[y_col]], axis=1).dropna()
    postterm_estimated[f"aatest_no"] = f"aatest_{i}"
    postterm_estimated[f"pseudo_intervention"] = temp_post_period[0]
    postterm_estimated["p_value"] = temp_impact_model.p_value
    postterm_estimated["train_datasize"] = len(df[: temp_pre_period[1]])

    test_result.append(postterm_estimated)

aatest_df = pd.concat(test_result).sort_index()
aatest_table = aatest_df.groupby("aatest_no").agg(
    {
        "point_effects": np.mean,
        "p_value": "max",
        "pseudo_intervention": "max",
        "preds": np.count_nonzero,
        "train_datasize": "max",
    }
)
aatest_table.columns = [
    "Absolute effect",
    "Posterior tail-area probability (p-value)",
    "pseudo_intervention",
    "simulation_termsize",
    "train_termsize",
]

image

# plot code
intervention_idx = aatest_df.index.get_loc(POST_PERIOD[0])

psedo_idx = [
    aatest_df.index.get_loc(_date)
    for _date in aatest_table[aatest_table.index != "real_test"][
        "pseudo_intervention"
    ].values
]
psedo_idx.remove(0)  

fig = plt.figure(figsize=(15, 12))
ax = plt.subplot(3, 1, 1)
ax.plot(aatest_df[y_col], "k", label="y")
ax.plot(aatest_df["preds"], "b--", label="Predicted")
ax.axvline(aatest_df.index[intervention_idx - 1], c="r", linestyle="--")
for ind_ in psedo_idx:
    ax.axvline(aatest_df.index[ind_ - 1], c="k", linestyle="--", alpha=0.25)
ax.fill_between(
    aatest_df.index,
    aatest_df["preds_lower"],
    aatest_df["preds_upper"],
    facecolor="blue",
    interpolate=True,
    alpha=0.25,
)
ax.grid(True, linestyle="--")
ax.legend()
plt.setp(ax.get_xticklabels(), visible=False)

ax = plt.subplot(3, 1, 2, sharex=ax)
ax.plot(aatest_df["point_effects"], "b--", label="Point Effects")
ax.axvline(aatest_df.index[intervention_idx - 1], c="r", linestyle="--")
for ind_ in psedo_idx:
    ax.axvline(aatest_df.index[ind_ - 1], c="k", linestyle="--", alpha=0.25)
ax.fill_between(
    aatest_df["point_effects"].index,
    aatest_df["point_effects_lower"],
    aatest_df["point_effects_upper"],
    facecolor="blue",
    interpolate=True,
    alpha=0.25,
)
ax.axhline(y=0, color="k", linestyle="--")
ax.grid(True, linestyle="--")
ax.legend()
plt.setp(ax.get_xticklabels(), visible=False)

ax = plt.subplot(3, 1, 3, sharex=ax)
ax.plot(aatest_df["post_cum_effects"], "b--", label="Cumulative Effect")
ax.axvline(aatest_df.index[intervention_idx - 1], c="r", linestyle="--")
for ind_ in psedo_idx:
    ax.axvline(aatest_df.index[ind_ - 1], c="k", linestyle="--", alpha=0.25)
ax.fill_between(
    aatest_df["post_cum_effects"].index,
    aatest_df["post_cum_effects_lower"],
    aatest_df["post_cum_effects_upper"],
    facecolor="blue",
    interpolate=True,
    alpha=0.25,
)
ax.grid(True, linestyle="--")
ax.axhline(y=0, color="k", linestyle="--")
ax.legend()
plt.show()

image

WillianFuks commented 2 years ago

Hi @MasaAsami ,

If I understood you correctly there's already a technique well known in this field called "back-testing". The idea is similar to what you discussed but instead of comparing impacts on training data against test data the approach is to choose a model and perform several evaluations on training data to evaluate its performance (using the same technique you described).

You can then compare several distinct models (adding various elements to it such as local level, trends, seasonal components and so on) and choose the best one to finally run it on test data.

I'd happily merge a PR with some feature like that being added, only requirements I'd ask is for it to be fully unit tested and there will be probably some issues related to performance so training should be performed in parallel as well (and I'm not quite sure how the various models would be chosen, I'm open to ideas).

As for pycausalimpact vs tfcausalimpact notice they are essentially the same. I still don't know what happened to the former as I no longer have contact with Dafiti's IT team but you can use the latter interchangeably.

Let me know if this answers your question.

Best,

Will

MasaAsami commented 2 years ago

I see, "back-testing" may indeed be a more appropriate expression.

The sensitivity of the causal effect of this scheme obviously depends on the model we set up, so I think it is actually the same as "AA testing".

I'll try to organize my thoughts and try it out during the holidays.

Thanks for the reply!

Masa