This Code Pattern will guide you through how to use XGBoost
, Scikit Learn
and Python
in IBM Watson Studio. The goal is to use a Jupyter notebook and data from the UCI repository for Bank Marketing Data to predict if a client will purchase a Certificate of Deposit (CD) from a banking institution.
Class imbalance is a common problem in data science, where the number of positive samples are significantly less than the number of negative samples. As data scientists, one would like to solve this problem and create a classifier with good performance. XGBoost (Extreme Gradient Boosting Decision Tree) is very common tool for creating the Machine Learning Models for classification and regression. However, there are various tricks and techniques for creating good classification models using XGBoost for imbalanced data-sets that is non-trivial and the reason for developing this Code Pattern.
In this Code Pattern, we will illustrate how the Machine Learning classification is performed using XGBoost, which is usually a better choice compared to logistic regression and other techniques. We will use a real life data set which is highly imbalanced (i.e the number of positive sample is much less than the number of negative samples).
This Code Pattern will walk the user through the following conceptual steps:
This Code Pattern consists of following activities:
Log in or sign up for IBM's Watson Studio.
Note: if you would prefer to skip the remaining Watson Studio set-up steps and just follow along by viewing the completed Notebook, simply:
- View the completed notebook and its outputs, as is.
- While viewing the notebook, you can optionally download it to store for future use.
- When complete, continue this code pattern by jumping ahead to the Explore, Analyze and Predict CD Subscription for Bank Client section.
New Project
option from the Watson Studio landing page and choose the Data Science
option.Cloud Object Storage
service or select an existing one from your IBM Cloud account.Assets
and Settings
tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.Assets
tab, click the + New notebook
button.From File
tab to specify the file location of the notebook in this repository./notebooks/predict_band_cd_subs_by_xgboost_clf_for_imbalance_dataset.ipynb
Create
button.Note: If queried for a Python version, select version
3.5
.
Assets
tab./data/bank.csv
), which you can upload as a data asset in your project. Do this by loading the dataset into the pop up section on the right hand side. Please see a screenshot of what it should look like below. Data Exploration
section.data_row_all
). Then place the cursor at the top of the cell above the 2 remaining lines of code. 1001
data icon in the top right. The bank.csv
data file should show up.Insert Pandas Data Frame
. Once you do that, a whole bunch of code will show up in your cell.df_data_1
and will cause errors if left in.When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
*
, this indicates that the cell is currently executing.There are several ways to execute the code cells in your notebook:
Play
button in the toolbar.Cell
menu bar, there are several options available. For example, you
can Run All
cells in your notebook, or you can Run All Below
, that will
start executing from the first cell under the currently selected cell, and then
continue executing all cells that follow.Schedule
button located in the top right section of your notebook
panel. Here you can schedule your notebook to be executed once at some future
time, or repeatedly at your specified interval.Under the File
menu, there are several ways to save your notebook:
Save
will simply save the current state of your notebook, without any version
information.Save Version
will save your current state of your notebook with a version tag
that contains a date and time stamp. Up to 10 versions of your notebook can be
saved, each one retrievable by selecting the Revert To Version
menu item.You can share your notebook by selecting the Share
button located in the top
right section of your notebook panel. The end result of this action will be a URL
link that will display a “read-only” version of your notebook. You have several
options to specify exactly what you want shared from your notebook:
Only text and output
: will remove all code cells from the notebook view.All content excluding sensitive code cells
: will remove any code cells
that contain a sensitive tag. For example, # @hidden_cell
is used to protect
your credentials from being shared.All content, including code
: displays the notebook as is.download as
options are also available in the menu.The imbalanced dataset is from Purtugese banking institutions, and is based on phone calls to bank clients regarding the purchase of financial products offered by the bank (ie. Certificates of Deposit).
For this section we will mostly use Python based libraries such as XGBoost, Scikit-learn, Matplotlib, SeaBorn, and Pandas.
Data scientists typically perform data exploration to gain better insight into data. Here we will explore inputs for distribution, correlation and outliers, and outputs to note any class imbalance issues.
In typical machine learning applications, an ML pipeline is created so that all the steps that are done on a training data set can be easily applied to the test set.
Model Training is a iterative process and we will do several iterations to improve our model performance.
Using XGBoost as our tool of choice, we will highlight classification performance metrics such as ROC curve, Precision-Recall curve, and Confusion Matrix.
We then offer multiple strategies to improve our classifier performance.
In many ML training applications, there is the risk that the model won't generalize enough for unknown data. To mitigate this, it is recommended that data scientists do generalization error testing. This involves running cross validated models to predict on held-out data, to see it's performance on test data. But it's important that we don't look at held-out data or use it in training because this can make our model training biased and result in a large generalization error.
The following screen-shots show that we set the weight on the positive sample to be 1000 and the feature selection threshold to be 0.008. In the third attempt running this tuned classifier, we find that our recall for an imbalanced positive sample has improved to 0.84 on the test data.
Awesome job following along! Now go try and take this further or apply it to a different use case!