Open adelnehme opened 4 years ago
For:
I don't seem to find differences in results between adding {'num_boost_round':10} to the params dictionary and to the xgb.cv function as an argument. Recommend keeping it all in the params dictionary since this is the only thing students have seen so far - similar for the early stopping section.
Maybe this doesn't show up on collabs, but otherwise, you usually get a warning when you do this. The param dictionary is meant for parameters specific to the individual boosters, not the whole model. I've added text to explain this.
- ... I also cannot get any plots outputted from plot_tree() as I get the error
I will debug this more tomorrow!
Please read the key below to understand how to respond to the feedback provided. Some items will require you to take action while others only need some thought. During this round of feedback, each item with an associated checkbox is an action item that should be implemented before you submit your content for review.
Key
:wrench: This must be fixed during this round of review. This change is necessary to create a good DataCamp live training.
:mag: This wasn't exactly clear. This will need rephrasing or elaborating to fully and clearly convey your meaning.
📣 This is something you should take a lookout for during the session and verbally explain once you arrive at this point.
:star: Excellent work!
General Feedback
#
, sub-headers##
- sub-sub-headers###
are all in bold font.Notebook Review
Data Dictionary
[x] 🔧 1. Would it be possible to segment the different columns to Target Variable: - Features:
[x] 🔧 2. Check out for typos in the explanation of each column, and make sure to normalize whether to use lower case or upper case after the
:
Which features are most correlated to cancelations?
[x] 🔧 3 - a Worth adding a formula of the Pearson Coefficient for 2 variables X and Y - it works really well with the great visualization you added.
[x] :mega: 3 - b Keep an eye out to simplify formulas when you explain them - Alex did a great job of doing that in her Time Series Analysis in Python session. For example, here we can just explain this as the "covariance of X and Y - divided by the product their two standard deviations".
[x] 🔧 4. Could be worth it to showcase the results of
.corr()
first in a separate cell - identifying it as a correlation matrix - then say I want to only look atis_canceled
therefore ...Your first XGBoost Classifier
[x] 🔧 5 - A. Since students will be tuning the XGBoost classifier with
max_depth
at a later stage in the notebook - I'm not sure we should touch themax_depth
hyperparameter here. While keepingmax_depth
to 3 - the results are 76% - which are an okay baseline that we can crush with some simple hyperparameter tuning and really clarify the point of using hyperparameter tuning.[x] 🔧 5 - B. I recommend using the markdown space here as more of a reminder of how XGBoost works - tying it into your introduction mentioned in A. would be ideal here as students instantiate the default XGBoost model.
[x] 📣 5 - C. For instance, you can introduce an example introduced in the slides here again "So guys as a reminder, the XGBoost algorithm utilizes boosting, i.e. creating and combining many weak learners to create a strong learner - Now we're going to instantiate a simple XGboost Classifier without changing any default hyperparameters - and we're going to inspect the hyperparameters." Having inspected the hyperparameters, you can zoom in on the
objective
andn_estimators
hyperparameters - which are directly tied to the dummy example you mention - since it will be a classification and will showcase more than 1 weak learner producing predictions.[x] 🔧 6. Not a fan of multiple
##
in code cells - try using markdown instead. For example, an expanded markdown cell where "Note the default objective for classification" would go great here to explain whatobjective
andn_estimators
represent, as well as how to calculate accuracy.[x] 🔧 7. Change to "Baseline accuracy" in the print function.
Visualizing your tree
[x] 🔧 9. Change the header of "#### plot_importance()" to "#### Plotting feature importance"
[x] 🔧 📣 10. Check out this - I like the explanation of
gain
here as the improvement in accuracy brought by a feature.[x] 🔧 11. If we're not using
cover
- not sure we should mention it explicitly. For example, you can just state verbally there are other importance types, we'll just be using these two.[x] 🔧 12. Not sure what is meant by "Plots won't plot unless axes are redefined". For me plotting is just fine - you can set default figsizes for all your plots without having to resort to
subplots
by usingplot_tree()
as I get the errorCross Validation in XGBoost
Cross validation with xgb.cv
{'num_boost_round':10}
to theparams
dictionary and to thexgb.cv
function as an argument. Recommend keeping it all in the params dictionary since this is the only thing students have seen so far - similar for the early stopping section.Digging into parameters
colsample_bytree
- you can add: