Scikit-Learn Cheat Sheet (2022), Python for Data Science
The absolute basics for beginners learning Scikit-Learn in 2022

Photo from Unsplash by Tim Stief
Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression, clustering algorithms, and efficient tools for data mining and data analysis. It’s built on NumPy, SciPy, and Matplotlib.
Sections:
1. Basic Example
2. Loading the Data
3. Training and Test Data
4. Preprocessing the Data
5. Create Your Model
6. Model Fitting
7. Prediction
8. Evaluate Your Model’s Performance
9. Tune Your Model
Basic Example:
The code below demonstrates the basic steps of using scikit-learn to create and run a model on a set of data.
The steps in the code include: loading the data, splitting into train and test sets, scaling the sets, creating the model, fitting the model on the data, using the trained model to make predictions on the test set, and finally evaluating the performance of the model.
Your data needs to be numeric and stored as NumPy arrays or SciPy spare matrix. Other types that convert to numeric arrays, such as Pandas DataFrame’s are also acceptable.
import numpy as np
X = np.random.random((10,5))array([[0.21069686, 0.33457064],
[0.23887117, 0.6093155 ],
[0.48848537, 0.62649292]])>>> y = np.array(['A','B','A'])array(['A', 'B', 'A'])
Training and Test Data
Splits the dataset into training and test sets for both the X and y variables.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y, random_state = 0)
Preprocessing The Data
Getting the data ready before the model is fitted.
Standardization
Standardize’s the features by removing the mean and scaling to unit variance.
Each sample (i.e. each row of the data matrix) with at least one non-zero component is rescaled independently of other samples so that its norm equals one.
Various regression and classification metrics that determine how well a model performed on a test set.
Classification Metrics
Accuracy Score
knn.score(X_test,y_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)
Classification Report
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
Confusion Matrix
from sklearn .metrics import confusion_matrix
print(confusion_matrix(y_test,y_pred))
Regression Metrics
Mean Absolute Error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_pred)
Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)
R² Score
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
Clustering Metrics
Adjusted Rand Index
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_test,y_pred)
Homogeneity
from sklearn.metrics import homogeneity_score
homogeneity_score(y_test,y_pred)
V-measure
from sklearn.metrics import v_measure_score
v_measure_score(y_test,y_pred)
Cross-Validation
Evaluate a score by cross-validation
from sklearn.model_selection import cross_val_score
print(cross_val_score(knn, X_train, y_train, cv=4))
Tune Your Model
Finding correct parameter values that will maximize a model’s prediction accuracy.
Grid Search
Exhaustive search over specified parameter values for an estimator. The example below attempts to find the right amount of clusters to specify for knn to maximize the model’s accuracy.
Randomized search on hyperparameters. In contrast to Grid Search, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.
Scikit-learn is an extremely useful library for various machine learning models. The sections above provided a basic step-by-step process to perform an analysis on different models. If you want to learn more however check out the documentation for Scikit-Learn, as there are still plenty of useful functions you could learn.
Scikit-Learn Cheat Sheet (2022), Python for Data Science
The absolute basics for beginners learning Scikit-Learn in 2022

Photo from Unsplash by Tim Stief
Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression, clustering algorithms, and efficient tools for data mining and data analysis. It’s built on NumPy, SciPy, and Matplotlib.
Sections: 1. Basic Example 2. Loading the Data 3. Training and Test Data 4. Preprocessing the Data 5. Create Your Model 6. Model Fitting 7. Prediction 8. Evaluate Your Model’s Performance 9. Tune Your Model
Basic Example:
The code below demonstrates the basic steps of using scikit-learn to create and run a model on a set of data.
The steps in the code include: loading the data, splitting into train and test sets, scaling the sets, creating the model, fitting the model on the data, using the trained model to make predictions on the test set, and finally evaluating the performance of the model.
Loading the Data
Your data needs to be numeric and stored as NumPy arrays or SciPy spare matrix. Other types that convert to numeric arrays, such as Pandas DataFrame’s are also acceptable.
Training and Test Data
Splits the dataset into training and test sets for both the X and y variables.
Preprocessing The Data
Getting the data ready before the model is fitted.
Standardization
Standardize’s the features by removing the mean and scaling to unit variance.
Normalization
Each sample (i.e. each row of the data matrix) with at least one non-zero component is rescaled independently of other samples so that its norm equals one.
Binarization
Binarize data (set feature values to 0 or 1) according to a threshold.
Encoding Categorical Features
Encode’s target labels with values between 0 and n_classes-1.
Imputing Missing Values
Imputation transformer for completing missing values.
Generating Polynomial Features
Generate a new feature matrix consisting of all polynomial combinations of the features with degrees less than or equal to the specified degree.
Create Your Model
Creation of various supervised and unsupervised learning models.
Supervised Learning Models
Linear Regression
Support Vector Machines (SVM)
Naive Bayes
KNN
Unsupervised Learning Models
Principal Component Analysis (PCA)
K means
Model Fitting
Fitting supervised and unsupervised learning models onto data.
Supervised learning
Fit the model to the data
Unsupervised learning
Fit the model to the data
Fit to data, then transform it
Prediction
Predicting test sets using trained models.
Predict labels
Supervised Estimators
Estimate probability of a label
Evaluate Your Model’s Performance
Various regression and classification metrics that determine how well a model performed on a test set.
Classification Metrics
Accuracy Score
Classification Report
Confusion Matrix
Regression Metrics
Mean Absolute Error
Mean Squared Error
R² Score
Clustering Metrics
Adjusted Rand Index
Homogeneity
V-measure
Cross-Validation
Evaluate a score by cross-validation
Tune Your Model
Finding correct parameter values that will maximize a model’s prediction accuracy.
Grid Search
Exhaustive search over specified parameter values for an estimator. The example below attempts to find the right amount of clusters to specify for knn to maximize the model’s accuracy.
Randomized Parameter Optimization
Randomized search on hyperparameters. In contrast to Grid Search, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.
Scikit-learn is an extremely useful library for various machine learning models. The sections above provided a basic step-by-step process to perform an analysis on different models. If you want to learn more however check out the documentation for Scikit-Learn, as there are still plenty of useful functions you could learn.