INRIA / scikit-learn-mooc

Machine learning in Python with scikit-learn MOOC
https://inria.github.io/scikit-learn-mooc
Creative Commons Attribution 4.0 International
1.12k stars 516 forks source link

Add video of HTML representation of `Pipeline` #465

Closed ArturoAmorQ closed 2 years ago

ArturoAmorQ commented 3 years ago

I recently learned that sklearn.set_config option display='diagram' is actually interactive and apparently I was not the only one surprised by this. A short video could show how to manipulate these HTML diagrams while motivating the use of a pipeline for assembling complex models.

ArturoAmorQ commented 3 years ago

The concept is a video running the following lines of code "life" and dynamically show the info displayed in the resulting diagram

import pandas as pd

ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values='?')

target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)
data.head()
numeric_features = ['LotArea', 'FullBath', 'HalfBath']
categorical_features = ['Neighborhood','HouseStyle']
data = data[numeric_features + categorical_features]
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler(),
)])

categorical_transformer = OneHotEncoder(handle_unknown='ignore')
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
])
from sklearn.linear_model import LogisticRegression

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression()),
])
from sklearn import set_config

set_config(display='diagram')
model
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=5)
cv_results

The use of the SimpleImputer is optional and is introduced here only to illustrate a slightly complex pipeline. If we decide to keep it, then we should explain in the video that using imputers is beyond our scope.

ArturoAmorQ commented 3 years ago

This example could serve as a baseline for addressing #435 and eventually #414. What I have in mind for the latter is adding the following lines of code to the ones above:

new_categorical_features = categorical_features.remove('Neighborhood')
new_numerical_features = ['Latitude','Longitud']

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, new_categorical_features)
])

model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])
from sklearn.neighbors import KNeighborsClassifier

new_preprocessor = ColumnTransformer(transformers=[
    ('lat_lon', numeric_transformer, new_numerical_features)
    ])

neighbors_model = Pipeline(steps=[('new_preprocessor', new_preprocessor),
                                 ])

from sklearn.ensemble import StackingClassifier

estimators = [('logistic', model),
              ('KNN',  neighbors_model)
             ]
clf = StackingClassifier(
    estimators=estimators, final_estimator= KNeighborsClassifier()
    )

clf

It uses StackingClassifier to make the model a bit more complex, feeding "longitude" and "latitude" of each neighborhood to a KNN classifier

ogrisel commented 3 years ago

Thanks @ArturoAmorQ.

Some notes on the recording you sent us:

GaelVaroquaux commented 3 years ago
ogrisel commented 3 years ago

Some rephrasing suggestions:

The pipeline is a really nice tool of scikit-learn to combine one or more data transformation steps with a final classifier or regressor model. The resulting pipeline object can itself be treated as a machine learning model and it avoids repetitive coding and data leaking ...

it will encode the variable with zeros everywhere

or more precisely:

it will encode it with zero values in all the columns that encode that specific categorical feature.

Miscellaneous remarks:

https://www.google.com/search?q=how+to+pronounce+column

I think this is not a problem as most people will understand anyway.

ogrisel commented 3 years ago

Also: "Hello! Today we are gonna talk about pipelines." => "Hello! In this video I will introduce pipelines."

ArturoAmorQ commented 3 years ago

Also: "Hello! Today we are gonna talk about pipelines." => "Hello! In this video I will introduce pipelines."

I am just wondering if saying "In this video I will introduce pipelines" is not something that may depend on where we place the video (i.e. before or after the first notebook mentioning it)?

Probably your first suggestion "present pipelines" is more adequate in this case.

ArturoAmorQ commented 3 years ago

But appart from that, thanks @GaelVaroquaux and @ogrisel for your time and feedback! I will take all of the above into account.

ogrisel commented 3 years ago

I am just wondering if saying "In this video I will introduce pipelines" is not something that may depend on where we place the video (i.e. before or after the first notebook mentioning it)?

I thought the same but I don't think it matters. I think it's fine even if we put the video after we notebook that introduce the pipeline using written text. As you wish. My main remark was more about avoiding "Today" and also trying to avoid "gonna" which I do not find it pretty even I tend to use it a lot myself as well.

ArturoAmorQ commented 3 years ago

My main remark was more about avoiding "Today" and also trying to avoid "gonna" which I do not find it pretty even I tend to use it a lot myself as well.

In the video I say "going to", as I agree that "gonna" does not sound very professional.

ogrisel commented 3 years ago

Indeed, I don't know why I wrongly memorized you said "gonna". I cannot trust my own ears...