datacamp / Machine-Learning-With-XGboost-live-training

Live Training Session: Machine Learning with XGboost
13 stars 20 forks source link

Notebook Review #2

Open adelnehme opened 4 years ago

adelnehme commented 4 years ago

DataCamp icon

Please read the key below to understand how to respond to the feedback provided. Some items will require you to take action while others only need some thought. During this round of feedback, each item with an associated checkbox is an action item that should be implemented before you submit your content for review.


Key

:wrench: This must be fixed during this round of review. This change is necessary to create a good DataCamp live training.

:mag: This wasn't exactly clear. This will need rephrasing or elaborating to fully and clearly convey your meaning.

📣 This is something you should take a lookout for during the session and verbally explain once you arrive at this point.

:star: Excellent work!


General Feedback

  • Introduction
  • Defining XGBoost - to be able to define XGBoost, let's first break it down to its component parts:
  • Here's a decision tree - this is what it does
  • Here's boosting - here's what it is
  • Explanation of base learners and how there are different types of base learners (regression vs tree) and that we will be using tree.
  • Give dummy example of it in action (check out the dummy example on xgboost. This example can be used later in the notebook to explain how a hyperparameter could affect the results 🤔
  • Tie it all up together and explain how it works in XGBoost
  • Mention that there are multiple parameters that we will explore in the session.

Notebook Review

Data Dictionary

Which features are most correlated to cancelations?

Your first XGBoost Classifier

Visualizing your tree

XGBoost has two handy visualization functions for interpreting results:

  • plot_importance(ax = ) which plots feature importance, i.e. how predictive each feature is for the target variable. It takes in the XGBoost model fitted and takes in the subplot axis in the ax argument.
  • plot_tree()similar treatment to plot_importance(). ..
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 8)
Unable to parse node: 0:[deposit_type_Non

Cross Validation in XGBoost

Cross validation with xgb.cv

Digging into parameters

From XGBoost docs:

The subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.

Essentially, this lets us limit the number of columns used when constructing each tree. This adds randomness, making the model more robust to noise. The default is 1 (i.e. all the columns), let's try a smaller value.

lisdc commented 4 years ago

For:

I don't seem to find differences in results between adding {'num_boost_round':10} to the params dictionary and to the xgb.cv function as an argument. Recommend keeping it all in the params dictionary since this is the only thing students have seen so far - similar for the early stopping section.

Maybe this doesn't show up on collabs, but otherwise, you usually get a warning when you do this. The param dictionary is meant for parameters specific to the individual boosters, not the whole model. I've added text to explain this.

lisdc commented 4 years ago
  1. ... I also cannot get any plots outputted from plot_tree() as I get the error

I will debug this more tomorrow!

lisdc commented 4 years ago

Hey Adel, I've implemented all your feedback here and on the slides as well.

I'll make the solution notebook on Monday after any additional feedback you or I may have after another look. Thanks for your guidance!