Minyus / causallift

CausalLift: Python package for causality-based Uplift Modeling in real-world business
https://causallift.readthedocs.io/
Other
338 stars 42 forks source link

Handle imbalance #5

Closed jroessler closed 4 years ago

jroessler commented 4 years ago

Hi,

is it possible to consider instance weights during training to handle imbalances?

XGBClassifier has a _sampleweight parameter during fitting the model. The parameter accepts a weight for each instance. How can I trigger it? I tried to include the parameter in the _uplift_modelparam dictionary when creating a CausaLift instance, but it did not work.

Thanks in advance

Minyus commented 4 years ago

If enable_ipw is set to True, reciprocal of propensity column is used as sample_weight.

You can see the code at: https://github.com/Minyus/causallift/blob/develop/src/causallift/nodes/model_for_each.py

To use sample_weight, you can compute reciprocal of sample weight and specify parameters like this:

CausalLift(train_df, test_df, enable_ipw=True, col_propensity=<column of reciprocal of sample weight here>)
Jami1141 commented 4 years ago

I want to use a parameter for scale_weight instead of giving a number. For example: (weight = number of positives/number of negatives) for train data. I tried to put scale_weight =weight but it does not accept. It only accepts number if I am not wrong. I need to give a parameter like weight to scale weight since I want to automate mode later. Since I do not calculate propensity, I can not use your explanation above. I am using a A/B test data to model them using Causal lift and later use model to predict them for new data.

Thanks in advance for your help.

Minyus commented 4 years ago

I added explanation in the following sections in README.md.

https://github.com/Minyus/causallift#how-causallift-works https://github.com/Minyus/causallift#how-to-run-inferrence-prediction-of-cate-for-new-data-with-treatment-and-outcome-unknown

enable_ipw flag is used only during training. You can use the workaround I suggested in my previous post even if your train_df is from A/B test data.

Jami1141 commented 4 years ago

Thanks for adding explanation! All are clear. Except that still I did not get If I can use a parameter like weight for scale_weight . I only need it because I do not want to put a number like 10 for scale weight and instead I put a parameter like weight (weight = number of positives/number of negatives) for train data? And what is reciprocal of propensity? Can I just use sample_weight parameter in XGBoost and only assign to it a parameter? And If I want to predict for new data, should all parameters of XGBoost including scale_weight be the same as training?

Minyus commented 4 years ago

Reciprocal for a number x is 1/x. Reciprocal of 5 is 1/5 = 0.2, for example. IPW (Inverse Probability Weighting) is a methodology to use reciprocal of propensity as sample weight.

I released CausalLIft v1.0.1. In this version, you do not need the workaround suggested earlier. You now should be able to code like this:

CausalLift(train_df, test_df, enable_ipw=False, enable_weighting=True, col_weight=<column name of sample weight in train_df here, "Weight" in default>)

Regarding prediction for new data, please see: https://github.com/Minyus/causallift#how-to-run-inferrence-prediction-of-cate-for-new-data-with-treatment-and-outcome-unknown

Since enable_weighting is used only for training, test_df does not need Weight column.

Jami1141 commented 4 years ago

What should I put inside of col_weight : col_weight=<column name of sample weight in train_df here, "Weight" in default> in my case?

Thanks

Minyus commented 4 years ago

If you want to use sample weighting, add the sample weight in "Weight" column in train_df as shown in the example below, and code like: CausalLift(train_df, test_df, enable_ipw=False, enable_weighting=True, col_weight="Weight")"

Feature1 Feature2 Treatment Outcome Weight
1.1 2.1 1 0 1.0
1.2 2.2 0 0 1.0
1.3 2.3 1 1 2.0
Jami1141 commented 4 years ago

Apparently, this sample weighting that you use is for weighting treatment. But what I meant is (scale_pos_weight) which is used for dealing with imbalanced data. Following you see that I put a number for scale_pos_weight. But I need to assign a parameter to it like weight, which weight is calculated as (weight = number of positives/number of negatives).

uplift_model_params={'cv': 3, 'estimator': 'xgboost.XGBClassifier', 'n_jobs': -1, 'param_grid': {'base_score': [0.5], 'booster': ['gbtree'], 'colsample_bylevel': [1], 'colsample_bytree': [1], 'gamma': [0], 'learning_rate': [0.1], 'max_delta_step': [0], 'max_depth': [3], 'min_child_weight': [1], 'missing': [None], 'n_estimators': [100], 'n_jobs': [-1], 'nthread': [None], 'objective': ['binary:logistic'], 'random_state': [0], 'reg_alpha': [0], 'reg_lambda': [1], 'scale_pos_weight': [47.09], 'subsample': [1], 'verbose': [0]}, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'})

Minyus commented 4 years ago

It seems you just need to compute scale_pos_weight from train_df and update uplift_model_params dict like this:

n_positives = len(train_df.query("Outcome == 1"))
print("number of positives: ", n_positives)

n_negatives = len(train_df.query("Outcome == 0"))
print("number of negatives: ", n_negatives)

scale_pos_weight = n_negatives / n_positives
print("scale_pos_weight: ", scale_pos_weight)

# Reference: https://xgboost.readthedocs.io/en/latest/parameter.html

uplift_model_params={'cv': 3, 'estimator': 'xgboost.XGBClassifier', 'n_jobs': -1, 'return_train_score': False, 'scoring': None, 'search_cv': 'sklearn.model_selection.GridSearchCV'}
param_grid = {'base_score': [0.5], 'booster': ['gbtree'], 'colsample_bylevel': [1], 'colsample_bytree': [1], 'gamma': [0], 'learning_rate': [0.1], 'max_delta_step': [0], 'max_depth': [3], 'min_child_weight': [1], 'missing': [None], 'n_estimators': [100], 'n_jobs': [-1], 'nthread': [None], 'objective': ['binary:logistic'], 'random_state': [0], 'reg_alpha': [0], 'reg_lambda': [1], 'subsample': [1], 'verbose': [0]}
param_grid.update({'scale_pos_weight': [scale_pos_weight]})
uplift_model_params.update({'param_grid': param_grid})
cl = CausalLift(train_df, test_df, enable_ipw=False, verbose=3, uplift_model_params=uplift_model_params)
Jami1141 commented 4 years ago

Thank you so much. Now it is very clear to me!