UBC-MDS / sktidy

Broom but for sklearn, to tidy up the messy fit results for Linear Regression and KMeans.
https://sktidy.readthedocs.io/en/latest
MIT License
0 stars 2 forks source link
kmeans-clustering linear-regression sklearn-models tidy-dataframes

sktidy

codecov Deploy Documentation Status

Python package that produces tidy output for sklearn model evaluation!

Summary

Sktidy implements a tidy and augment function for Scikit learn linear regression and kmeans clustering to ease model selection and assessment tasks. The tidy family of functions will provide similar functionality to tidy in the Pybroom but for sklearn models, returning a tidy pandas dataframe with important model information at the level of features or clusters for linear regression and kmeans clustering respectively. The augment function will provide information at the level of the original data point on how points were clustered and silhoutte scores for kmeans clustering and predicted values and residuals for linear regression in a neat pandas data frame.

How sktidy fits into the Python ecosystem

The functions tidy and augment functions are inspired by the functions tidy and augment in the Pybroom package which is inspired by the R library broom. The current implementation of Pybroom support scipy and lmfit objects. Sklearn lacks a similar functionality that allows users to obtain model fitting results in a tidy dataframe that makes it easy to process and plot the data. Tidy dataframes allows plotting libraries to automatically generate plots to compare many variables without the need for lengthy data cleaning and wrangling. Plotting libraries supporting tidy DataFrames include seaborn, recent versions of matplotlib, bokeh and altair.

Installation

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple sktidy

Features

Dependencies

The dependencies for this package are:

For more details, you can check out pyproject.toml

Usage

Linear Regression

    # Importing packages
    from sklearn.linear_model import LinearRegression
    from sklearn import datasets
    import pandas as pd
    import sktidy
    # Load data and traning the linear regression model
    X = datasets.load_iris(return_X_y = True, as_frame = True)[0]
    y = datasets.load_iris(return_X_y = True, as_frame = True)[1]
    lr_model = LinearRegression()
    lr_model.fit(X,y)
    # Get tidy output for the trained sklearn LinearRegression model
    tidy_lr(model = lr_model, X = X, y = y)
    # Getting predicted y values and residuals
    augment_lr(model = lr_model,X = X,y = y)

KMeans

    # Importing packages
    from sklearn.cluster import DBSCAN, KMeans
    from sklearn import datasets
    import pandas as pd
    import sktidy
    # Extracting data and training the clustering algorithm
    df = datasets.load_iris(return_X_y = True, as_frame = True)[0]
    kmeans_clusterer = KMeans()
    kmeans_clusterer.fit(df)
    # Getting the tidy df of cluster information
    tidy_kmeans(model = kmeans_clusterer, X = df)
    # Getting cluster assignment for each data point
    augment_kmeans(model = kmeans_clusterer, X = df)

Documentation

The official documentation is hosted on Read the Docs: https://sktidy.readthedocs.io/en/latest/

Contributors

We welcome and recognize all contributions. You can see a list of current contributors in the contributors tab.

The original contributors to the project were:

Credits

This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.