almost-matching-exactly / DAME-FLAME-Python-Package

A Python Package providing two algorithms, DAME and FLAME, for fast and interpretable treatment-control matches of categorical data
https://almost-matching-exactly.github.io/DAME-FLAME-Python-Package/
MIT License
56 stars 14 forks source link

"No object to concatenate error" #44

Open nehargupta opened 2 years ago

nehargupta commented 2 years ago

When running with the adaptive_weights='decisionTreeCV' parameter, in late iterations, sometimes the error 'no object to concatenate' appears. It seems like users who see this error can avoid it by using the early_stop_iterations parameter, or by using another learning method than decision trees, and possibly by binarizing columns before running the algorithm as well

supatuffpinkpuff commented 1 year ago

I'm encountering this issue, but I think I've determined a cause.

It clearly has to do with line 82, in find_pe_for_covar_set, in flame_dame_helpers.py binarized_df = pd.get_dummies(x_treated.loc[:, non_bool_cols].astype(str))

Based on the error message coming from pd.get_dummies, this seems to indicate that pd.get_dummies is being passed an empty dataframe. As to why that's happening, I think it has to do with the precise number of columns that are boolean/non boolean used for analysis. Since the code is iteratively removing columns, potentially including the non-boolean columns, at some point depending on the dataset, there might be an iteration where there are no non-boolean columns, in which case pd.get_dummies is passed an empty dataframe leading to this issue.

I believe a fix would involve adding an additional check before running pd.get_dummies to make sure that it's not being passed an empty dataframe. Perhaps adding

if len(non_bool_cols) > 0:

before each call of pd.get_dummies would suffice? Will do some more testing tomorrow.

I made some fake datasets as test cases, and found the following:

The below dataset has only boolean integer columns, and fails with this error on iteration 1. data = pd.DataFrame({'bool_integer_categories1':[1, 1, 0, 0, 1], 'treatment':[0, 0, 0, 1, 1], 'outcome':[5, 1, 2, 3, 4], 'bool_integer_categories2':[1, 0, 1, 0, 0]})

The below dataset has only non-boolean integer columns, and succeeds after two iterations. data = pd.DataFrame({'nonbool_integer_categories1':[100, 100, 10, 10, 1], 'treatment':[0, 0, 0, 1, 1], 'outcome':[5, 1, 2, 3, 4], 'nonbool_integer_categories2':[1, 2, 3, 1, 2]})

The below dataset has one non-boolean integer column and one boolean integer column, and fails on iteration 1. data = pd.DataFrame({'nonbool_integer_categories':[100, 10, 10, 50, 50], 'treatment':[0, 0, 0, 1, 1], 'outcome':[5, 1, 2, 3, 4], 'boolean_integer_column':[0, 0, 1, 1, 0]})

Adding another boolean integer column, still fails on iteration 1. data = pd.DataFrame({'nonbool_integer_categories':[100, 10, 10, 50, 50], 'treatment':[0, 0, 0, 1, 1], 'outcome':[5, 1, 2, 3, 4], 'boolean_integer_column':[0, 0, 1, 1, 0], 'boolean_int_col_2':[1, 0, 1, 0, 1]})

However, adding more non_boolean columns still fails, but survives another iteration to iteration 2. data = pd.DataFrame({'nonbool_integer_col_1':[100, 10, 10, 50, 50], 'nonbool_integer_col_2':[1, 1, 2, 2, 3], 'treatment':[0, 0, 0, 1, 1], 'outcome':[5, 1, 2, 3, 4], 'boolean_integer_column':[0, 0, 1, 1, 0]})

Code, Logs, and Traceback from the last example below: `def test_mixed_bool_more_cols_1():

data = pd.DataFrame({'nonbool_integer_col_1':[100, 10, 10, 50, 50], 
                    'nonbool_integer_col_2':[1, 1, 2, 2, 3], 
                    'treatment':[0, 0, 0, 1, 1],
                     'outcome':[5, 1, 2, 3, 4],
                     'boolean_integer_column':[0, 0, 1, 1, 0]})
# return data

model_flame = dame_flame.matching.FLAME(repeats=False, verbose=3, adaptive_weights='decisiontree')
model_flame.fit(holdout_data=data, treatment_column_name='treatment', outcome_column_name='outcome')
result_flame = model_flame.predict(data)

print('ATE:')
print(dame_flame.utils.post_processing.ATE(model_flame))

# Visualizing CATE of matched groups from FLAME.
group_size_treated = []
group_size_overall = []
cate_of_group = []
for group in model_flame.units_per_group:

    # find len of just treated units
    df_mmg = data.loc[group]
    treated = df_mmg.loc[df_mmg["treatment"] == 1]

    group_size_treated.append(len(treated))
    group_size_overall.append(len(group))

    cate_of_group.append(dame_flame.utils.post_processing.CATE(model_flame, group[0]))

plt.scatter(group_size_treated, cate_of_group, alpha=0.25, edgecolors='b')
plt.axhline(y=0.0, color='r', linestyle='-')
plt.xlabel('Number of Treatment units in group', fontsize=12)
plt.ylabel('Estimated Treatment Effect of Group', fontsize=12)
plt.title("Visualizing CATE of matched groups from FLAME", fontsize=14)
plt.savefig('interpretability.png')

` Iteration number: 1 Number of matched groups formed in total: 0 Unmatched treated units: 2 out of a total of 2 treated units Unmatched control units: 3 out of a total of 3 control units Predictive error of covariates chosen this iteration: 0 Number of matches made in this iteration: 0 Number of matches made so far: 0 In this iteration, the covariates dropped are: set() Iteration number: 2 Number of matched groups formed in total: 0 Unmatched treated units: 2 out of a total of 2 treated units Unmatched control units: 3 out of a total of 3 control units Predictive error of covariates chosen this iteration: 0.0 Number of matches made in this iteration: 0 Number of matches made so far: 0 In this iteration, the covariates dropped are: nonbool_integer_col_2

Error

Traceback (most recent call last): File "test_mixed_bool_more_cols_1", line 1, in File "test_mixed_bool_more_cols_1", line 12, in test_mixed_bool_more_cols_1 File "/opt/conda/lib/python3.7/site-packages/dame_flame/matching.py", line 219, in predict pre_dame, C) File "/opt/conda/lib/python3.7/site-packages/dame_flame/matching.py", line 416, in _FLAME want_bf, mice_on_hold, early_stops, pre_dame, C) File "/opt/conda/lib/python3.7/site-packages/dame_flame/flame_algorithm.py", line 210, in flame_generic df_unmatched, return_matches, C, weight_array) File "/opt/conda/lib/python3.7/site-packages/dame_flame/flame_algorithm.py", line 78, in decide_drop adaptive_weights, alpha_given) File "/opt/conda/lib/python3.7/site-packages/dame_flame/flame_dame_helpers.py", line 82, in find_pe_for_covar_set binarized_df = pd.get_dummies(x_treated.loc[:, non_bool_cols].astype(str)) File "/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/reshape.py", line 903, in get_dummies result = concat(with_dummies, axis=1) File "/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 295, in concat sort=sort, File "/opt/conda/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 342, in init raise ValueError("No objects to concatenate") ValueError: No objects to concatenate

nehargupta commented 1 year ago

Hi @supatuffpinkpuff I just saw your comment. Thanks for using this package and thoroughly debugging it!

If I'm not mistaken, this issue should be resolved in my local branch but somehow didn't make it up to the master yet...you can see my branch here: https://github.com/nehargupta/DAME-FLAME-Python-Package, and my bug fix to this assuming it's alright is this one: https://github.com/almost-matching-exactly/DAME-FLAME-Python-Package/commit/e0887078298262b7c51d56e82b9440131c8db0c7

Please feel free to check out my local branch if that suffices for your needs, and definitely let me know if you're still seeing this issue. I hope to have this version control issue sorted and it merged with the master soon, so we should be able to close this soon I hope. Please let me know if you think the issue is still persistent or I'm mistaken somehow.