Live training sessions are designed to mimic the flow of how a real data scientist would address a problem or a task. As such, a session needs to have some “narrative” where learners are achieving stated learning objectives in the form of a real-life data science task or project. For example, a data visualization live session could be around analyzing a dataset and creating a report with a specific business objective in mind (ex: analyzing and visualizing churn), a data cleaning live session could be about preparing a dataset for analysis etc ...
As part of the 'Live training Spec' process, you will need to complete the following tasks:
Edit this README by filling in the information for steps 1 - 4.
This part of the 'Live training Spec' process is designed to help guide you through session design by having you think through several key questions. Please make sure to delete the examples provided here for you.
XGBoost
modelsXGBoost
's DMatrix
to optimize computing performanceXGBoost
using the right metricsXGBoost
to achieve the best resultsXGBoost
to analyze feature importanceNote that there will be no pre-processing in this live training. The data will be presented with clean ready-to-use features.
pandas
numpy
scikit-learn
xgboost
Whether during your opening and closing talk or your live training, you might have to define some terms and jargon to walk students through a problem you’re solving. Intuitive explanations using analogies are encouraged.
To help minimize the amount of Q&As and make your live training re-usable, list out some mistakes and misconceptions you think students might encounter along the way.
DMatrix
is the same as numpy
's array
or pandas
's data frames
XGBoost
is a library specialized in gradient boosting. It is not an acronym or slang for gradient boosting. And, there are other libraries that allow you to implement gradient boostings (e.g, scikit-learn
).Live training sessions are designed to walk students through something closer to a real-life data science workflow. Accordingly, the dataset needs to accommodate that user experience. As a rule of thumb, your dataset should always answer yes to the following question:
Is the dataset/problem I’m working on, something an industry data scientist/analyst could work on?
Check our datasets to avoid list.
Dataset: Hotel Booking Demand
Problem: Predict whether a booking will be cancelled
Terms like "beginner" and "expert" mean different things to different people, so we use personas to help instructors clarify a live training's audience. When designing a specific live training, instructors should explain how it will or won't help these people, and what extra skills or prerequisite knowledge they are assuming their students have above and beyond what's included in the persona.
Check all that apply.
XGBoost is a powerful machine learning library that became very popular after winning several Kaggle competitions. It is an asset to anyone in machine learning or looking to up-skill in machine learning (e.g., data analysts and citizen data scientists).
List one or more industries that the content would be appropriate for.
This is relevant for all industries.
This isn't an industry, but this course is especially interesting for anyone who likes doing Kaggle competitions.
List three or more examples of skills that you expect learners to have before beginning the live training
- Can draw common plot types (scatter, bar, histogram) using matplotlib and interpret them
- Can run a linear regression, use it to make predictions, and interpret the coefficients.
- Can calculate grouped summary statistics using SELECT queries with GROUP BY clauses.
scikit-learn
to train machine learning models, including functions like fit()
, predict()
, and train_test_split()
.scikit-learn
with cross validation.List any prerequisite courses you think your live training could use from. This could be the live session’s companion course or a course you think students should take before the session. Prerequisites act as a guiding principle for your session and will set the topic framework, but you do not have to limit yourself in the live session to the syntax used in the prerequisite courses.
A live training session usually begins with an introductory presentation, followed by the live training itself, and an ending presentation. Your live session is expected to be around 2h30m-3h long (including Q&A) with a hard-limit at 3h30m. You can check out our live training content guidelines here.
XGBoost
's DMatrix
as an alternative to dataframesbooster
options and note that we will be using gbtree
which uses a tree as a weak learner (this is the default and the most common). In contrast, there is a gblinear
option that uses linear regression as a weak learner.DMatrix
xgb.train
eval_metric
xgb.cv
xgb.plot_tree
params = {"objective":"reg:linear", "max_depth":2}
booster
num_boost_round
objective
max_depth
num_trees
n_estimator
min_child_weight
early_stopping_rounds
fold
lambda
, gamma
and alpha
for regularizationXGBoost
's Scikit-Learn APIscikit-learn
's GridSearchCV
scikit-learn
's RandomizedSearchCV
To get yourself started with setting up your live session, follow the steps below:
data
folder.assets
folder.notebooks
folder, and keep the template you want for your session while deleting all remaining ones.File
, Save a copy in GitHub
and follow remaining prompts. You can also download the notebook locally and develop the content there as long you test out that the syntax works on Colabs as well.session_name_solution.ipynb
create an empty version of the Notebook to be filled out by you and learners during the session, end the file name with session_name_learners.ipynb
.