Add video of HTML representation of `Pipeline`

ArturoAmorQ commented 3 years ago

I recently learned that sklearn.set_config option display='diagram' is actually interactive and apparently I was not the only one surprised by this. A short video could show how to manipulate these HTML diagrams while motivating the use of a pipeline for assembling complex models.

ArturoAmorQ commented 3 years ago

The concept is a video running the following lines of code "life" and dynamically show the info displayed in the resulting diagram

import pandas as pd

ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values='?')

target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)

data.head()

numeric_features = ['LotArea', 'FullBath', 'HalfBath']
categorical_features = ['Neighborhood','HouseStyle']
data = data[numeric_features + categorical_features]

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler(),
)])

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

from sklearn.linear_model import LogisticRegression

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression()),
])

from sklearn import set_config

set_config(display='diagram')
model

from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=5)
cv_results

The use of the SimpleImputer is optional and is introduced here only to illustrate a slightly complex pipeline. If we decide to keep it, then we should explain in the video that using imputers is beyond our scope.

ArturoAmorQ commented 3 years ago

This example could serve as a baseline for addressing #435 and eventually #414. What I have in mind for the latter is adding the following lines of code to the ones above:

new_categorical_features = categorical_features.remove('Neighborhood')
new_numerical_features = ['Latitude','Longitud']

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, new_categorical_features)
])

model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

from sklearn.neighbors import KNeighborsClassifier

new_preprocessor = ColumnTransformer(transformers=[
    ('lat_lon', numeric_transformer, new_numerical_features)
    ])

neighbors_model = Pipeline(steps=[('new_preprocessor', new_preprocessor),
                                 ])

from sklearn.ensemble import StackingClassifier

estimators = [('logistic', model),
              ('KNN',  neighbors_model)
             ]
clf = StackingClassifier(
    estimators=estimators, final_estimator= KNeighborsClassifier()
    )

clf

It uses StackingClassifier to make the model a bit more complex, feeding "longitude" and "latitude" of each neighborhood to a KNN classifier

ogrisel commented 3 years ago

Thanks @ArturoAmorQ.

Some notes on the recording you sent us:

I like the general tone of the video and I think it's good to have a video that shows some code to wrap up the points presented throughout the first module in a different way for better assimilation and the interactive features of the pipeline visualization widget is not necessarily easy to discover on their own.
For the intro you start with something like "Welcome to another video about machine learning with scikit-learn. Today we are going to talk about pipeline...". Assuming that readers might watch several videos in the same session, these intro at the beginning of each video might feel redundant. Maybe you could start directly with an intro sentence that more precisely state the purpose of this video: "Hello! In this video I will present how to define a simple scikit-learn machine learning pipeline and use visualize its structure in a jupyter notebook!"
When you introduce the columns you chose, I think you should say that you picked them up semi-arbitrarily in the hope that they will be predictive of the price. The way you currently phrased it could imply that you know of a systematic way to select the good predictive variable and that you use this method to choose them and I am sure that people will ask about it in the forum but this is not the goal of this video.
To explain OneHotEncoder(handle_unknow="ignore"), I think it would be better to explain that "ignore" means encoding unknown categories with 0s everywhere.
The cursor of the mouse is not visible and this can be confusing when you click on the diagram. I believe I fixed that problem in OBS by recording a full screen (on a second screen) instead of a specific window.
I think it's important to try to record in a room with no echo. To record my videos I used a blanket put on a panel that I installed a meter away on my side to avoid echos between the plain walls of my room.
using an external microphone (e.g. on a head set) is usually much better than the computer microphone
when recording with OBS, it's possible to add a noise filter to the audio capture device. It's not perfect but better than nothing. But the first 2 recommendations are more important and the OBS filters cannot properly fix echo effects of the room.

GaelVaroquaux commented 3 years ago

The tone is great!
There is some background noise. You'll need to get a better mike
The notebook probably needs a bit more text (and we will make it available to students. I would suggest to invest on this before on the video
I think the video should stress a bit more its didactic goal: "The goal of this video is to make you comfortable with pipelines, which can get a bit complex". The video should probably be coming to this goal on a regular basis in the video: "We now have a more complex pipeline: how to we visualize it, and how do we understand it?"
Pipeline are not only for convenience, but also because that the help avoiding leakage between train and test during model validation
In the part where you talk about cross-validation, at the end, mention that the pipeline makes it easy to manipulate this complex chain of transformations (coming back to the didactic goal of the video)

ogrisel commented 3 years ago

Some rephrasing suggestions:

Be more specific on the presentation of the pipeline at the beginning

The pipeline is a really nice tool of scikit-learn to combine one or more data transformation steps with a final classifier or regressor model. The resulting pipeline object can itself be treated as a machine learning model and it avoids repetitive coding and data leaking ...

I think you can drop the sentence on "The more complex your model is, the more complex the pipeline will be" as it does not really bring actionable information.
about ignore in OHE, using the expression a "label of 0" is a bit misleading because 0 is not really an (ordinal) label but a numerical value after OHE encoding. Let me suggest either a concise but approximate:

it will encode the variable with zeros everywhere

or more precisely:

it will encode it with zero values in all the columns that encode that specific categorical feature.

Miscellaneous remarks:

I edited the indenting of the code for the definition of the pipelines and column transformers in the code snippets of https://github.com/INRIA/scikit-learn-mooc/issues/465#issuecomment-927724624 . I think using a more consistent indentation style in this notebook would help making it slightly easier to follow. I agree that the indentation style in our notebooks is not necessarily that consistent and we would probably benefit from re-exploring the opportunity to reformat them all to be more "vertically indented", for instance using black with a 80 char limit but this is to be discussed in another issue with everybody. Small manual formatting changes should be enough for this video.
Not that important but I noticed you tend to pronounce "colium" instead of "column"

https://www.google.com/search?q=how+to+pronounce+column

I think this is not a problem as most people will understand anyway.

I think the final click on the toplevel "Pipeline" header in the HTML widget could be skipped entirely because of the full definition of the pipeline is too large to be readable inside that HTML widget and it's already well formatted in the cell above.

ogrisel commented 3 years ago

Also: "Hello! Today we are gonna talk about pipelines." => "Hello! In this video I will introduce pipelines."

ArturoAmorQ commented 3 years ago

Also: "Hello! Today we are gonna talk about pipelines." => "Hello! In this video I will introduce pipelines."

I am just wondering if saying "In this video I will introduce pipelines" is not something that may depend on where we place the video (i.e. before or after the first notebook mentioning it)?

Probably your first suggestion "present pipelines" is more adequate in this case.

ArturoAmorQ commented 3 years ago

But appart from that, thanks @GaelVaroquaux and @ogrisel for your time and feedback! I will take all of the above into account.

ogrisel commented 3 years ago

I am just wondering if saying "In this video I will introduce pipelines" is not something that may depend on where we place the video (i.e. before or after the first notebook mentioning it)?

I thought the same but I don't think it matters. I think it's fine even if we put the video after we notebook that introduce the pipeline using written text. As you wish. My main remark was more about avoiding "Today" and also trying to avoid "gonna" which I do not find it pretty even I tend to use it a lot myself as well.

ArturoAmorQ commented 3 years ago

My main remark was more about avoiding "Today" and also trying to avoid "gonna" which I do not find it pretty even I tend to use it a lot myself as well.

In the video I say "going to", as I agree that "gonna" does not sound very professional.

ogrisel commented 3 years ago

Indeed, I don't know why I wrongly memorized you said "gonna". I cannot trust my own ears...

INRIA / scikit-learn-mooc

Add video of HTML representation of `Pipeline` #465