Closed ArturoAmorQ closed 2 years ago
The concept is a video running the following lines of code "life" and dynamically show the info displayed in the resulting diagram
import pandas as pd
ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values='?')
target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)
data.head()
numeric_features = ['LotArea', 'FullBath', 'HalfBath']
categorical_features = ['Neighborhood','HouseStyle']
data = data[numeric_features + categorical_features]
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler(),
)])
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
])
from sklearn.linear_model import LogisticRegression
model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression()),
])
from sklearn import set_config
set_config(display='diagram')
model
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data, target, cv=5)
cv_results
The use of the SimpleImputer is optional and is introduced here only to illustrate a slightly complex pipeline. If we decide to keep it, then we should explain in the video that using imputers is beyond our scope.
This example could serve as a baseline for addressing #435 and eventually #414. What I have in mind for the latter is adding the following lines of code to the ones above:
new_categorical_features = categorical_features.remove('Neighborhood')
new_numerical_features = ['Latitude','Longitud']
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, new_categorical_features)
])
model = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])
from sklearn.neighbors import KNeighborsClassifier
new_preprocessor = ColumnTransformer(transformers=[
('lat_lon', numeric_transformer, new_numerical_features)
])
neighbors_model = Pipeline(steps=[('new_preprocessor', new_preprocessor),
])
from sklearn.ensemble import StackingClassifier
estimators = [('logistic', model),
('KNN', neighbors_model)
]
clf = StackingClassifier(
estimators=estimators, final_estimator= KNeighborsClassifier()
)
clf
It uses StackingClassifier
to make the model a bit more complex, feeding "longitude" and "latitude" of each neighborhood to a KNN classifier
Thanks @ArturoAmorQ.
Some notes on the recording you sent us:
OneHotEncoder(handle_unknow="ignore")
, I think it would be better to explain that "ignore" means encoding unknown categories with 0s everywhere.Some rephrasing suggestions:
The pipeline is a really nice tool of scikit-learn to combine one or more data transformation steps with a final classifier or regressor model. The resulting pipeline object can itself be treated as a machine learning model and it avoids repetitive coding and data leaking ...
I think you can drop the sentence on "The more complex your model is, the more complex the pipeline will be" as it does not really bring actionable information.
about ignore in OHE, using the expression a "label of 0" is a bit misleading because 0 is not really an (ordinal) label but a numerical value after OHE encoding. Let me suggest either a concise but approximate:
it will encode the variable with zeros everywhere
or more precisely:
it will encode it with zero values in all the columns that encode that specific categorical feature.
Miscellaneous remarks:
I edited the indenting of the code for the definition of the pipelines and column transformers in the code snippets of https://github.com/INRIA/scikit-learn-mooc/issues/465#issuecomment-927724624 . I think using a more consistent indentation style in this notebook would help making it slightly easier to follow. I agree that the indentation style in our notebooks is not necessarily that consistent and we would probably benefit from re-exploring the opportunity to reformat them all to be more "vertically indented", for instance using black with a 80 char limit but this is to be discussed in another issue with everybody. Small manual formatting changes should be enough for this video.
Not that important but I noticed you tend to pronounce "colium" instead of "column"
https://www.google.com/search?q=how+to+pronounce+column
I think this is not a problem as most people will understand anyway.
Also: "Hello! Today we are gonna talk about pipelines." => "Hello! In this video I will introduce pipelines."
Also: "Hello! Today we are gonna talk about pipelines." => "Hello! In this video I will introduce pipelines."
I am just wondering if saying "In this video I will introduce pipelines" is not something that may depend on where we place the video (i.e. before or after the first notebook mentioning it)?
Probably your first suggestion "present pipelines" is more adequate in this case.
But appart from that, thanks @GaelVaroquaux and @ogrisel for your time and feedback! I will take all of the above into account.
I am just wondering if saying "In this video I will introduce pipelines" is not something that may depend on where we place the video (i.e. before or after the first notebook mentioning it)?
I thought the same but I don't think it matters. I think it's fine even if we put the video after we notebook that introduce the pipeline using written text. As you wish. My main remark was more about avoiding "Today" and also trying to avoid "gonna" which I do not find it pretty even I tend to use it a lot myself as well.
My main remark was more about avoiding "Today" and also trying to avoid "gonna" which I do not find it pretty even I tend to use it a lot myself as well.
In the video I say "going to", as I agree that "gonna" does not sound very professional.
Indeed, I don't know why I wrongly memorized you said "gonna". I cannot trust my own ears...
I recently learned that
sklearn.set_config
optiondisplay='diagram'
is actually interactive and apparently I was not the only one surprised by this. A short video could show how to manipulate these HTML diagrams while motivating the use of a pipeline for assembling complex models.