d2cml-ai / csdid

CSDID
https://d2cml-ai.github.io/csdid/index.html
MIT License
23 stars 6 forks source link

IndexError while using ATTgt package #41

Open shalakawani opened 1 month ago

shalakawani commented 1 month ago

Hi,

I am trying to understand the updated package and running it against a simulated data as follows:

out = ATTgt(yname='outcome',
           gname='treatment_month',
           idname='seller_id',
           tname='month',
           allow_unbalanced_panel=True,
           xformla=f'outcome~1',
           control_group='never_treated',
           data=panel_data.reset_index()
           ).fit(est_method='dr')

I am getting a following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[8], line 9
      1 out = ATTgt(yname='outcome',
      2            gname='treatment_month',
      3            idname='seller_id',
      4            tname='month',
      5            allow_unbalanced_panel=True,
      6            xformla=f'outcome~1',
      7            control_group='never_treated',
      8            data=panel_data.reset_index()
----> 9            ).fit(est_method='dr')

File ~\AppData\Roaming\Python\Python311\site-packages\csdid\att_gt.py:39, in ATTgt.fit(self, est_method, base_period, bstrap)
     36 def fit(self, est_method = 'dr', base_period = 'varying', bstrap = True):
     37   # print(self.dp)
     38   dp = self.dp
---> 39   result, inffunc = compute_att_gt(dp)
     40   att = result['att']
     41   crit_val, se, V = np.zeros(len(att)), np.zeros(len(att)), np.zeros(len(att))

File ~\AppData\Roaming\Python\Python311\site-packages\csdid\attgt_fnc\compute_att_gt.py:83, in compute_att_gt(dp, est_method, base_period)
     81 n1 = data[gname] == 0
     82 n2 = (data[gname] > (tlist[np.max([t_i, pret]) + tfac]) + anticipation)
---> 83 n3 = np.where(data[gname] != glist[g], True, False)
     84 row_eval = n1 | n2 & n3
     85 data = data.assign(C = 1 * row_eval)

IndexError: index 15 is out of bounds for axis 0 with size 4

I tried changing control group definition and other parameters but still got the same error.

I am attaching a code to generate simulated data:

# Set seed for reproducibility
np.random.seed(51)

# Define parameters
num_months = 36
num_sellers = 100
mean_outcome = 10
noise_level = 0.5
treatment_effect = 2
treatment_noise_level = 0.2
treatment_months = [15, 18, 21, 24]
cohort_sizes = [15, 15, 15, 15]

# Initialize data structure
data = []

# Generate seller data
for seller_id in range(1, num_sellers + 1):
    # Randomly assign join month between 1 and 12
    join_month = np.random.randint(1, 13)

    # Randomly decide if seller leaves between month 24 and 36
    leave_month = np.random.choice([0] + list(range(24, 37)), p=[0.75] + [0.25 / 13] * 13)

    # Default outcome with noise
    outcome = mean_outcome + np.random.normal(0, noise_level, num_months)

    # Treatment assignment
    if seller_id <= 40:
        treatment_status = 0
        treatment_month = np.nan
    else:
        cumulative_sizes = np.cumsum(cohort_sizes)
        if seller_id <= 40 + cumulative_sizes[0]:
            cohort = 0
        elif seller_id <= 40 + cumulative_sizes[1]:
            cohort = 1
        elif seller_id <= 40 + cumulative_sizes[2]:
            cohort = 2
        else:
            cohort = 3

        treatment_month = treatment_months[cohort]
        treatment_status = 1

        # Apply treatment effect after treatment month
        outcome[treatment_month - 1:] += treatment_effect + np.random.uniform(-treatment_noise_level, treatment_noise_level, num_months - treatment_month + 1)

    # Create panel data for each month
    for month in range(1, num_months + 1):
        if month >= join_month and (leave_month == 0 or month < leave_month):
            data.append([seller_id, month, outcome[month - 1], treatment_status, treatment_month])

# Create DataFrame
columns = ['seller_id', 'month', 'outcome', 'treatment_status', 'treatment_month']
panel_data = pd.DataFrame(data, columns=columns)
panel_data['seller_id'] = panel_data['seller_id'].astype(str)
panel_data = panel_data.set_index(keys=['seller_id', 'month'])

panel_data = panel_data.reset_index()
panel_data = panel_data.sort_values(by=['seller_id', 'month'], ascending=False)
panel_data['outcome12'] = panel_data.groupby('seller_id')['outcome'].transform(lambda x: x.rolling(12).sum())
panel_data['outcome12'] = panel_data['outcome12'].fillna(panel_data.groupby('seller_id')['outcome12'].transform('mean'))
panel_data = panel_data.set_index(keys=['seller_id', 'month'])

#Specifically for CSDID
panel_data = panel_data.reset_index()
panel_data['seller_id'] = panel_data['seller_id'].astype('int64')
panel_data['treatment_month'].fillna(0, inplace=True)
panel_data['treatment_month'] = panel_data['treatment_month'].astype('int64')

I also tried the code with the data provided in the repository example, and it works perfectly fine. My simulated data looks exactly like the data you provided. I will really really appreciate your help to debug the issue here.

Thank you.