Subtle bug with pathological column names

WillianFuks / tfcausalimpact

Python Causal Impact Implementation Based on Google's R Package. Built using TensorFlow Probability.

Apache License 2.0

600 stars 72 forks source link

Subtle bug with pathological column names #17

Closed felixlawrence closed 3 years ago

felixlawrence commented 3 years ago

If data is a DataFrame that contains a column labelled 0, and this is not the first column, then this is implicitly a .loc[0] instead of a .iloc[0], meaning that the wrong mu and sigma get stored, leading to incorrect results after unstandardizing.

Since mu and sig could either be Series or ndarray, one solution is to change this line to

    mu_sig = (np.array(mu[0]), np.array(sig[0]))

https://github.com/WillianFuks/tfcausalimpact/blob/a4bf1e0b010c4111b359c0b158dd3ffec3c438ff/causalimpact/data.py#L192

This might sound like an unlikely scenario, but it's what you encounter when you join your y data to a separate control DataFrame that has default column names...

WillianFuks commented 3 years ago

Hi @felixlawrence ,

You're right x.x! Thanks for debugging that out (this scenario didn't even cross my mind during development lol).

Even the proposed solution mu_sig = (np.array(mu[0]), np.array(sig[0])) won't work because the operation mu[0] already retrieves the wrong value.

I'm considering on changing the whole indexing steps to iloc altogether (not sure if this approach is safe though. I'll add some unit tests to verify here).

Hopefully I can deploy a fixed version soon.

Thanks!

felixlawrence commented 3 years ago

Argh! Typo! I meant np.array(mu)[0] .

mu.iloc[0] will cause an error for ndarray mu, won't it? Hence my conversion.

On May 20, 2021, at 5:27 PM, willfuks @.***> wrote:

Hi @felixlawrence ,

You're right x.x! Thanks for debugging that out (this scenario didn't even cross my mind during development lol).

Even the proposed solution mu_sig = (np.array(mu[0]), np.array(sig[0])) won't work because the operation mu[0] already retrieves the wrong value.

I'm considering on changing the whole indexing steps to iloc altogether (not sure if this approach is safe though. I'll add some unit tests to verify here).

Hopefully I can deploy a fixed version soon.

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

WillianFuks commented 3 years ago

standardize signature expects pandas dataframes as input which means mu and sig will always be a dataframe as well.

I'm deploying the fix now, it seems to be working as expected. I'll let you know when I merge to master.

WillianFuks commented 3 years ago

Hi @felixlawrence ,

I just published the new code, could you please confirm it's working as expected? Just run pip install tfcausalimpact==0.0.5 and the new code should already be available.

Best,

(I'll close the issue when you confirm everything is fine. It closed automatically as I referenced it in the PR)

felixlawrence commented 3 years ago

Confirmed working in 0.0.5. Thanks for such a speedy fix!

WillianFuks commented 3 years ago

Cool! If you find any other issues please let me know.

Best Regards,

Will