alexanderquispe commented 4 months ago

From @pedrohcgs :

But I think some of the codes are not matching R.

I am attaching the code in R above, which give me the following results: Group Time ATT(g,t) Std. Error [95% Pointwise Conf. Band]
2 2 -0.0746 0.0293 -0.132 -0.0173 *

When I run the equivalent code in python, I get this: Group Time ATT(g, t) Post Std. Error [95% Pointwise Conf. Band] 0 2 2 -0.0746 1 0.0564 -0.1909 0.0416

The difference in std errors is pretty huge!

This is the Python code I used:

# Install the csdid package if it's not already installed 
# !pip install csdid 
# Import the necessary module 
from csdid.att_gt import ATTgt import pandas as pd 
# Load the CSV file 
file_path = '/Downloads/cohort_2023.csv' data = pd.read_csv(file_path)
 # Prepare  the data
 data['customer_id_numeric'] = data['customer_id'].astype('category').cat.codes 
# Create an instance of the ATTgt class and fit the model 
out = ATTgt(yname="ln_gms",             gname="first.treat",             idname="customer_id_numeric",        
     tname="time_period",             data=data).fit(est_method='reg') 
# Display the results 
out.summ_attgt().summary2

alexanderquispe commented 4 months ago

Error Description

The error was found in the reg_panel function and has been resolved in the file reg_did.py. The resolution involved creating functions with numpy arrays in flatten format and using np.new_axis as recommended by numpy. This adjustment was made for the callbacks reg_panel and reg_rc. Additionally, the object dp did not extract dp['panel'] within the file compute_att_gt.py, which caused it to always be false and led to the use of dr_rc and reg_did_rc functions instead.

To obtain the standard error without bootstrap (bstrap=False), using the formula sd(inf_function) / sqrt(len(inf_function)) in Python, you can use:

n_len = list(map(len, inffunc))
np.std(inffunc, axis=1) / np.sqrt(n_len)

Potential Differences

Comparing the results between R and Python, they are not completely identical for the following output with a 3-element array:

> sd(c(1, 3, 4))
# [1] 1.527525

np.std([1, 3, 4])
# 1.247219128924647

Result

After correcting the errors, the following output is achieved:

!pip uninstall csdid DRDID -y
pip install git+https://github.com/d2cml-ai/DRDID
pip install git+https://github.com/d2cml-ai/CSDID

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm

lm = sm.WLS
from csdid.att_gt import ATTgt

df = pd.read_csv('../data/r_cohort.csv')

Callback

from drdid.reg_did import reg_did_panel

out = ATTgt(yname="ln_gms", gname="first.treat", idname="customer_id_num",
            tname="time_period", data=df).fit(est_method=reg_did_panel, bstrap=False)

# Display the results
out.summ_attgt().summary2

Python

R

PhilippBach commented 2 months ago

Hi @alexanderquispe

I realized the same, when cross-checking the Python website & R package webste ; in the introduction example the differences are quite substantive

Is the current Python docu using the fixed version (according to this issue) or will this be updated? I was a bit confused when I replicated the python example... I think the example is based on a way smaller sample (the dta data frame has 320 rows), which is of course way smaller than the sample size of the R example with 15916 rows... So maybe the differences only come from this, but I think it's hard to see... Thanks!

Thanks for the great work!

alexanderquispe commented 1 month ago

Hi @PhilippBach Thanks a lot for this issue. I realized that the dataset used in the python tutorial was only a sample. So I uploaded the correct dataset

https://raw.githubusercontent.com/d2cml-ai/csdid/main/data/sim_data.csv

I was a bit concern that the package was not working :)

R package

Python package - updated dataset

Please if you can rerun our package and confirm that everything is ok, we will appreciate it. @pedrohcgs

PhilippBach commented 1 month ago

Hi @alexanderquispe

thanks for your response! The results look virtually identical 👍

d2cml-ai / csdid

results are not the same as in R package #21

Error Description

Potential Differences

Result

Callback

Python

R