Notebook Review

adelnehme commented 4 years ago

Please read the key below to understand how to respond to the feedback provided. Some items will require you to take action while others only need some thought. During this round of feedback, each item with an associated checkbox is an action item that should be implemented before you submit your content for review.

Key

:wrench: This must be fixed during this round of review. This change is necessary to create a good DataCamp live training.

:mag: This wasn't exactly clear. This will need rephrasing or elaborating to fully and clearly convey your meaning.

📣 This is something you should take a lookout for during the session and verbally explain once you arrive at this point.

:star: Excellent work!

General Feedback

[x] 🔧 A. A great notebook so far! Now that we see how the session will play out, I recommend giving slightly less importance to providing a quick overview of the parameters in the slides, and instead really concentrate on providing an easy/intuitive explanation of the XGBoost algorithm itself by expanding upon "History of gradient boosting going from decision trees, bagging, random Forrests, to boosting. Terms will be explained on top of each other. It will end with an understanding of gradient boosting.". If we nail explaining XGBoost intuitively, I anticipate students will also be able to deeply understand how each hyperparameter affects and plays out the performance of the model.

Introduction

Defining XGBoost - to be able to define XGBoost, let's first break it down to its component parts:

Here's a decision tree - this is what it does

Here's boosting - here's what it is

Explanation of base learners and how there are different types of base learners (regression vs tree) and that we will be using tree.

Give dummy example of it in action (check out the dummy example on xgboost. This example can be used later in the notebook to explain how a hyperparameter could affect the results 🤔

Tie it all up together and explain how it works in XGBoost

Mention that there are multiple parameters that we will explore in the session.

[x] 🔧 B. Make sure all headers #, sub-headers ## - sub-sub-headers ### are all in bold font.

Data Dictionary

[x] 🔧 1. Would it be possible to segment the different columns to Target Variable: - Features:
[x] 🔧 2. Check out for typos in the explanation of each column, and make sure to normalize whether to use lower case or upper case after the :

Which features are most correlated to cancelations?

[x] 🔧 3 - a Worth adding a formula of the Pearson Coefficient for 2 variables X and Y - it works really well with the great visualization you added.
[x] :mega: 3 - b Keep an eye out to simplify formulas when you explain them - Alex did a great job of doing that in her Time Series Analysis in Python session. For example, here we can just explain this as the "covariance of X and Y - divided by the product their two standard deviations".
[x] 🔧 4. Could be worth it to showcase the results of .corr() first in a separate cell - identifying it as a correlation matrix - then say I want to only look at is_canceled therefore ...

Your first XGBoost Classifier

[x] 🔧 5 - A. Since students will be tuning the XGBoost classifier with max_depth at a later stage in the notebook - I'm not sure we should touch the max_depth hyperparameter here. While keeping max_depth to 3 - the results are 76% - which are an okay baseline that we can crush with some simple hyperparameter tuning and really clarify the point of using hyperparameter tuning.
[x] 🔧 5 - B. I recommend using the markdown space here as more of a reminder of how XGBoost works - tying it into your introduction mentioned in A. would be ideal here as students instantiate the default XGBoost model.
[x] 📣 5 - C. For instance, you can introduce an example introduced in the slides here again "So guys as a reminder, the XGBoost algorithm utilizes boosting, i.e. creating and combining many weak learners to create a strong learner - Now we're going to instantiate a simple XGboost Classifier without changing any default hyperparameters - and we're going to inspect the hyperparameters." Having inspected the hyperparameters, you can zoom in on the objective and n_estimators hyperparameters - which are directly tied to the dummy example you mention - since it will be a classification and will showcase more than 1 weak learner producing predictions.
[x] 🔧 6. Not a fan of multiple ## in code cells - try using markdown instead. For example, an expanded markdown cell where "Note the default objective for classification" would go great here to explain what objective and n_estimators represent, as well as how to calculate accuracy.
[x] 🔧 7. Change to "Baseline accuracy" in the print function.

Visualizing your tree

[x] 🔧 8. Beef up the markdown cells so that it looks like this:

XGBoost has two handy visualization functions for interpreting results:

plot_importance(ax = ) which plots feature importance, i.e. how predictive each feature is for the target variable. It takes in the XGBoost model fitted and takes in the subplot axis in the ax argument.

plot_tree()similar treatment to plot_importance(). ..

[x] 🔧 9. Change the header of "#### plot_importance()" to "#### Plotting feature importance"
[x] 🔧 📣 10. Check out this - I like the explanation of gain here as the improvement in accuracy brought by a feature.
[x] 🔧 11. If we're not using cover - not sure we should mention it explicitly. For example, you can just state verbally there are other importance types, we'll just be using these two.
[x] 🔧 12. Not sure what is meant by "Plots won't plot unless axes are redefined". For me plotting is just fine - you can set default figsizes for all your plots without having to resort to subplots by using

import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 8)

[x] 🔧 13. Same as 9 - change the "plot_tree()" header title to something a bit more organic sounding. I also cannot get any plots outputted from plot_tree() as I get the error

Unable to parse node: 0:[deposit_type_Non

Cross Validation in XGBoost

[x] 🔧 14. Consider using this datacampy image instead - I made it myself in this deck that I keep misc visualizations in.

Cross validation with xgb.cv

[x] 🔧 15 - A. I don't seem to find differences in results between adding {'num_boost_round':10} to the params dictionary and to the xgb.cv function as an argument. Recommend keeping it all in the params dictionary since this is the only thing students have seen so far - similar for the early stopping section.

Digging into parameters

[x] 🔧 16. In this section, I recommend adding an intuitive explanation of what each hyperparameter is. This is a great opportunity to connect it back to the example mentioned earlier in the slides. For example, under colsample_bytree - you can add:

From XGBoost docs:

The subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.

Essentially, this lets us limit the number of columns used when constructing each tree. This adds randomness, making the model more robust to noise. The default is 1 (i.e. all the columns), let's try a smaller value.

lisdc commented 4 years ago

For:

I don't seem to find differences in results between adding {'num_boost_round':10} to the params dictionary and to the xgb.cv function as an argument. Recommend keeping it all in the params dictionary since this is the only thing students have seen so far - similar for the early stopping section.

Maybe this doesn't show up on collabs, but otherwise, you usually get a warning when you do this. The param dictionary is meant for parameters specific to the individual boosters, not the whole model. I've added text to explain this.

lisdc commented 4 years ago

... I also cannot get any plots outputted from plot_tree() as I get the error

I will debug this more tomorrow!

lisdc commented 4 years ago

Hey Adel, I've implemented all your feedback here and on the slides as well.

I'll make the solution notebook on Monday after any additional feedback you or I may have after another look. Thanks for your guidance!

datacamp / Machine-Learning-With-XGboost-live-training

Notebook Review #2

Key

General Feedback

Notebook Review

Data Dictionary

Which features are most correlated to cancelations?

Your first XGBoost Classifier

Visualizing your tree

Cross Validation in XGBoost

Cross validation with xgb.cv

Digging into parameters