Minyus / causallift

CausalLift: Python package for causality-based Uplift Modeling in real-world business
https://causallift.readthedocs.io/
Other
335 stars 42 forks source link
causal-impact causal-inference causality counterfactual econometrics propensity-score propensity-scores uplift uplift-modeling

CausalLift: Python package for Uplift Modeling in real-world business; applicable for both A/B testing and observational data

PyPI version Python Version License: BSD-2-Clause Documentation Open In Colab

Introduction

Scenario 1: Marketing campaign/promotion targeting

Suppose you are responsible for a marketing campaign/promotion (show an advertisement, offer discount, make a phone call, etc.) to some customers to increase revenue or prevent churns. Which one will you choose?

Strategy B is known as Uplift Modelling.

Scenario 2: Recommendation systems in E-commerce sites

Suppose you are responsible for recommendation system at a E-commerce company. Which one will you choose?

Strategy B is known as Uplift Modelling.

Scenario 3: US presidential campaign

Suppose you are trying to make a candidate to be the next US president. Which one will you choose?

Strategy B is known as Uplift Modelling, and used by Barack Obama in 2012. Here are some articles.

Scenario 4: Avoid death

Suppose you can receive one of the following words of the God of Machine Learning. Which one will you choose?

Option B is the analogy of Uplift Modeling.

What is Uplift Modeling?

Uplift Modeling is a Machine Learning technique to find which customers (individuals) should be targeted ("treated") and which customers should not be targeted.

Uplift Modeling is also known as persuasion modeling, incremental modeling, treatment effects modeling, true lift modeling, or net modeling.

Uplift Modeling predicts the following 4 labels:

How does Uplift Modeling work?

Uplift Modeling estimates uplift scores (a.k.a. CATE: Conditional Average Treatment Effect or ITE: Individual Treatment Effect). Uplift score is how much the estimated conversion rate will increase by the campaign.

Suppose you are in charge of a marketing campaign to sell a product, and the estimated conversion rate (probability to buy a product) of a customer is 50 % if targeted and the estimated conversion rate is 40 % if not targeted, then the uplift score of the customer is (50-40) = +10 % points. Likewise, suppose the estimated conversion rate if targeted is 20 % and the estimated conversion rate if not targeted is 80%, the uplift score is (20-80) = -60 % points (negative value).

The range of uplift scores is between -100 and +100 % points (-1 and +1). It is recommended to target customers with high uplift scores and avoid customers with negative uplift scores to optimize the marketing campaign.

What are the advantages of "CausalLift" package?

Why CausalLift was developed?

In a word, to use for real-world business.

CausalLift flow diagram

CausalLift internal pipeline (visualized by Kedro Viz)

Supported Python versions

Installation

Install dependencies

$ pip install python-json-logger<=2.0.4 kedro<=0.17.7 scikit-learn<=0.21.3 numpy pandas easydict

Note:

Install CausalLift

$ pip install causallift
$ pip install git+https://github.com/Minyus/causallift.git
$ git clone https://github.com/Minyus/causallift.git
$ cd pipelinex
$ python setup.py develop

Optional:

Optional for visualization of the pipeline:

How is the data pipeline implemented by CausalLift?

Step 0: Prepare data

Prepare the following columns in 2 pandas DataFrames, train and test (validation).

Example table data

Step 1: Prepare for Uplift modeling and optionally estimate propensity scores using a supervised classification model

If the train_df is from observational data (not A/B Test), you can set enable_ipw=True so IPW (Inverse Probability Weighting) can address the issue that treatment should have been chosen based on a different probability (propensity score) for each individual (e.g. customer, patient, etc.)

If the train_df is from A/B Test or RCT (Randomized Controlled Trial), set enble_ipw=False to skip estimating propensity score.

Step 2: Estimate CATE by 2 supervised classification models

Train 2 supervised classification models (e.g. XGBoost) for treated and untreated samples independently and compute estimated CATE (Conditional Average Treatment Effect), ITE (Individual Treatment Effect), or uplift score.

This step is the Uplift Modeling consisting of 2 sub-steps:

  1. Training using train_df (Note: Treatment and Outcome are used)

  2. Prediction of CATE for train_df and test_df (Note: Neither Treatment nor Outcome is used.)

Step 3 [Optional] Estimate impact by following recommendation based on CATE

Estimate how much conversion rate will increase by selecting treatment (campaign) targets as recommended by the uplift modeling.

You can optionally evaluate the predicted CATE for train_df and test_df (Note: CATE, Treatment and Outcome are used.)

This step is optional; you can skip if you want only CATE and you do not find this evaluation step useful.

How to use CausalLift?

There are 2 ways:

[Deprecated option] Use causallift.CausalLift class interface

Please see the demo code in Google Colab (free cloud CPU/GPU environment):

Open In Colab

To run the code, navigate to "Runtime" >> "Run all".

To download the notebook file, navigate to "File" >> "Download .ipynb".

Here are the basic steps to use.

from causallift import CausalLift

""" Step 1. """
cl = CausalLift(train_df, test_df, enable_ipw=True)

""" Step 2. """
train_df, test_df = cl.estimate_cate_by_2_models()

""" Step 3. """
estimated_effect_df = cl.estimate_recommendation_impact()

[Recommended option] Use causallift.nodes subpackage with PipelineX package

Please see PipelineX package and use PipelineX Causallift example project.

How to run inference (prediction of CATE for new data with Treatment and Outcome unknown)?

Use the whole historical data (A/B Test data or observational data) as train_df instead of splitting into tran_df and test_df, and use the new data with Treatment and Outcome unknown as test_df.

This is possible because Treatment and Outcome are not used for prediction of CATE after Uplift Model is trained using Treatment and Outcome.

Please note that valid evaluation for test_df will not be available as valid Treatment and Outcome are not available.

Details about the parameters

Please see [CausalLift API document].

Related Python packages

Related R packages

References

Introductory resources about Uplift Modeling

License

BSD 2-clause License.

To-dos

Contributing

Any feedback is welcome!

Please create an issue for questions, suggestions, and feature requests. Please open pull requests to improve documentation, usability, and features against develop branch.

Separate pull requests for each improvement are appreciated rather than a big pull request. It is encouraged to use:

If you could write a review about CausalLift in any natural languages (English, Chinese, Japanese, etc.) or implement similar features in any programming languages (R, SAS, etc.), please let me know. I will add the link here.

Keywords to search

[English] Causal Inference, Counterfactual, Propensity Score, Econometrics

[中文] 因果推断, 反事实, 倾向评分, 计量经济学

[日本語] 因果推論, 反事実, 傾向スコア, 計量経済学

Article about CausalList in Japanese

Author:

Yusuke Minami

Contributors:

@farismosman