[Request] Is the A/A test function inappropriate?

MasaAsami commented 3 years ago

Thank you for a very nice package.

I'm trying to switch from pycausalImapct recently, and I don't fully understand this module yet, so I'll only provide some ideas today. [Idea]. How about adding a feature like so-called A/A test?

[Contents] For the period before the true intervention:

We can set up a pseudo-intervention [randomly or as a split observation window] for the period before the intervention.
Estimate the causal impact (ATE / p-value) of the same observation period after the pseudo-intervention.
Repeat n times. => Compare the distribution of the estimates of the n pseudo-interventions with the estimates of the true intervention (ATE / p-value) to confirm the veracity of this study.

[The following is just an image]

The data is the same as Google's blog post
As you can see, the process is very inefficient, so please refer to the following just to share my idea. (This code could be made smarter by switching to sequential learning instead of batch learning.)
By the way, the following still uses pycausalImpact. I apologize for the inconvenience.

 impact_model = causalimpact.CausalImpact(df[["DEXUSUK", "DEXUSEU"]], PRE_PERIOD, POST_PERIOD, nseasons=[{'period': 7, 'harmonics': 2}])

list_date = list(df[: PRE_PERIOD[1]].index)
post_n = 21
test_n = 50
test_result = []
minimum_n = 50
y_col = "DEXUSUK"

# True intervention data will also be included for comparison.
postterm_estimated = pd.concat([impact_model.inferences, df[y_col]], axis=1).dropna()
postterm_estimated[f"aatest_no"] = "real_test"
postterm_estimated[f"pseudo_intervention"] = POST_PERIOD[0]
postterm_estimated["p_value"] = impact_model.p_value
postterm_estimated["train_datasize"] = len(df[: PRE_PERIOD[1]])
test_result.append(postterm_estimated)

for i in tqdm(range(1, test_n + 1)):
    if post_n * i + minimum_n >= len(list_date):
        # Minimum sample size required for learning
        break

    temp_pre_period = [list_date[0], list_date[-post_n * i]]
    temp_post_period = [list_date[-post_n * i + 1], list_date[-post_n * i + post_n - 1]]

    temp_df = df[: temp_post_period[1]][["DEXUSUK", "DEXUSEU"]]

    # TODO: I'm doing batch processing, which is very inefficient (since I'm discarding the results of the previous learning process). I'm now considering improving it to sequential learning.
    temp_impact_model = causalimpact.CausalImpact(
            temp_df,
            temp_pre_period,
            temp_post_period,
            nseasons=[{"period": 7, "harmonics": 2}],
     )

    postterm_estimated = pd.concat([temp_impact_model.inferences, temp_df[y_col]], axis=1).dropna()
    postterm_estimated[f"aatest_no"] = f"aatest_{i}"
    postterm_estimated[f"pseudo_intervention"] = temp_post_period[0]
    postterm_estimated["p_value"] = temp_impact_model.p_value
    postterm_estimated["train_datasize"] = len(df[: temp_pre_period[1]])

    test_result.append(postterm_estimated)

aatest_df = pd.concat(test_result).sort_index()
aatest_table = aatest_df.groupby("aatest_no").agg(
    {
        "point_effects": np.mean,
        "p_value": "max",
        "pseudo_intervention": "max",
        "preds": np.count_nonzero,
        "train_datasize": "max",
    }
)
aatest_table.columns = [
    "Absolute effect",
    "Posterior tail-area probability (p-value)",
    "pseudo_intervention",
    "simulation_termsize",
    "train_termsize",
]

# plot code
intervention_idx = aatest_df.index.get_loc(POST_PERIOD[0])

psedo_idx = [
    aatest_df.index.get_loc(_date)
    for _date in aatest_table[aatest_table.index != "real_test"][
        "pseudo_intervention"
    ].values
]
psedo_idx.remove(0)  

fig = plt.figure(figsize=(15, 12))
ax = plt.subplot(3, 1, 1)
ax.plot(aatest_df[y_col], "k", label="y")
ax.plot(aatest_df["preds"], "b--", label="Predicted")
ax.axvline(aatest_df.index[intervention_idx - 1], c="r", linestyle="--")
for ind_ in psedo_idx:
    ax.axvline(aatest_df.index[ind_ - 1], c="k", linestyle="--", alpha=0.25)
ax.fill_between(
    aatest_df.index,
    aatest_df["preds_lower"],
    aatest_df["preds_upper"],
    facecolor="blue",
    interpolate=True,
    alpha=0.25,
)
ax.grid(True, linestyle="--")
ax.legend()
plt.setp(ax.get_xticklabels(), visible=False)

ax = plt.subplot(3, 1, 2, sharex=ax)
ax.plot(aatest_df["point_effects"], "b--", label="Point Effects")
ax.axvline(aatest_df.index[intervention_idx - 1], c="r", linestyle="--")
for ind_ in psedo_idx:
    ax.axvline(aatest_df.index[ind_ - 1], c="k", linestyle="--", alpha=0.25)
ax.fill_between(
    aatest_df["point_effects"].index,
    aatest_df["point_effects_lower"],
    aatest_df["point_effects_upper"],
    facecolor="blue",
    interpolate=True,
    alpha=0.25,
)
ax.axhline(y=0, color="k", linestyle="--")
ax.grid(True, linestyle="--")
ax.legend()
plt.setp(ax.get_xticklabels(), visible=False)

ax = plt.subplot(3, 1, 3, sharex=ax)
ax.plot(aatest_df["post_cum_effects"], "b--", label="Cumulative Effect")
ax.axvline(aatest_df.index[intervention_idx - 1], c="r", linestyle="--")
for ind_ in psedo_idx:
    ax.axvline(aatest_df.index[ind_ - 1], c="k", linestyle="--", alpha=0.25)
ax.fill_between(
    aatest_df["post_cum_effects"].index,
    aatest_df["post_cum_effects_lower"],
    aatest_df["post_cum_effects_upper"],
    facecolor="blue",
    interpolate=True,
    alpha=0.25,
)
ax.grid(True, linestyle="--")
ax.axhline(y=0, color="k", linestyle="--")
ax.legend()
plt.show()

WillianFuks commented 3 years ago

Hi @MasaAsami ,

If I understood you correctly there's already a technique well known in this field called "back-testing". The idea is similar to what you discussed but instead of comparing impacts on training data against test data the approach is to choose a model and perform several evaluations on training data to evaluate its performance (using the same technique you described).

You can then compare several distinct models (adding various elements to it such as local level, trends, seasonal components and so on) and choose the best one to finally run it on test data.

I'd happily merge a PR with some feature like that being added, only requirements I'd ask is for it to be fully unit tested and there will be probably some issues related to performance so training should be performed in parallel as well (and I'm not quite sure how the various models would be chosen, I'm open to ideas).

As for pycausalimpact vs tfcausalimpact notice they are essentially the same. I still don't know what happened to the former as I no longer have contact with Dafiti's IT team but you can use the latter interchangeably.

Let me know if this answers your question.

Best,

Will

MasaAsami commented 3 years ago

I see, "back-testing" may indeed be a more appropriate expression.

The sensitivity of the causal effect of this scheme obviously depends on the model we set up, so I think it is actually the same as "AA testing".

I'll try to organize my thoughts and try it out during the holidays.

Thanks for the reply!

Masa

WillianFuks / tfcausalimpact

[Request] Is the A/A test function inappropriate? #35