LineaLabs / lineapy

Move fast from data science prototype to pipeline. Capture, analyze, and transform messy notebooks into data pipelines with just two lines of code.
https://lineapy.org
Apache License 2.0
663 stars 58 forks source link
LineaPy

Capture, analyze, and transform messy notebooks into data pipelines
with just two lines of code.

Follow LineaPy on Twitter! Join the LineaPy Slack!

Ask questions or learn about our workshops on our Slack!

πŸ‘‡ Try It Out! πŸ‘‡

Open in Colab

https://user-images.githubusercontent.com/13392380/169427654-487d8d4b-3eda-462a-a96c-51c151f39ab9.mp4

Python Versions Build Netlify Status License PyPi

What Problems Can LineaPy Solve?

Use Case 1: Cleaning Messy Notebooks

When working in a Jupyter notebook day after day, it's easy to write messy code — You might execute cells out of order, execute the same cell repeatedly, and edit or delete cells until you've acquired good results, especially when generating tables, models, and charts. This highly dynamic and interactive notebook use, however, can introduce some issues. Our colleagues may not be able to reproduce our results by rerunning our notebook, and worse still, we ourselves may forget the steps required to produce our previous results.

One way to avoid this problem is to keep the notebook in sequential order by constantly re-executing the entire notebook during development. This approach, however, interrupts our natural workflows and stream of thoughts, decreasing our productivity. Therefore, it is much more common to clean up the notebook after development. This is a time-consuming process that is not immune from the reproducibility issues caused by deleted cells and out-of-order cell executions.

To see how LineaPy can help with messy notebooks, check out this demo or Open in Colab.

Use Case 2: Revisiting Previous Work

Data science is often a team effort where one person's work relies on results from another's. For example, a data scientist building a model may use features engineered by other colleagues. When using results generated by other people, we may encounter data quality issues including missing values, suspicious numbers, and unintelligible variable names. When we encounter these issues, we may need to check how these results came into being in the first place. Often, this means tracing back the code that was used to generate the result in question. In practice, this can be a challenging task because we may not know who produced the result. Even if we know who to ask, that person might not remember where the exact version of the code is stored, or worse, may have overwritten the code without version control. Additionally, the person may no longer be at the organization and may not have handed over the relevant knowledge. In any of these cases, it becomes extremely difficult to identify the root any issues, rendering the result unreliable and even unusable.

To see how LineaPy can help here, check out this demo or Open in Colab.

Use Case 3: Building Pipelines

As our notebooks become more mature, we may use them like pipelines. For example, our notebook might process the latest data to update a dashboard, or pre-process data and dump it into the file system for downstream model development. To keep our results up-to-date, we might be expected to re-execute these processes on a regular basis. Running notebooks manually is a brittle process that's prone to errors, so we may want to set up proper pipelines for production. If relevant engineering support is not available, we need to clean up and refactor our notebook code so that it can be used in orchestration systems or job schedulers, such as cron, Apache Airflow, Argo, Kubeflow, DVC, or Ray. Of course, this assumes that we already know how these tools work and how to use them — If not, we need to spend time learning about them in the first place! All this operational work is time-consuming, and detracts from the time that we can spend on our core duties as a data scientist.

To see how LineaPy can help here, check out this demo or Open in Colab.

Getting Started

LineaPy is a Python package for capturing, analyzing, and automating data science workflows. At a high level, LineaPy traces the sequence of code execution to form a comprehensive understanding of the code and its context. This understanding allows LineaPy to provide a set of tools that help data scientists bring their work to production more quickly and easily, with just two lines of code.

Check this section for types of problems that LineaPy can help to solve.

Prerequisites

LineaPy runs on Python>=3.7,<3.11 and IPython>=7.0.0. It does not come with a Jupyter installation, so you will need to install one for interactive computing.

Installation

To install LineaPy, run:

pip install lineapy

If you want to run the latest version of LineaPy directly from the source, follow instructions here.

LineaPy offers several extras to extend its core capabilities, such as support for PostgreSQL or Amazon S3. Learn more about these and other installation options here.

Interfaces

Jupyter and IPython

To use LineaPy in an interactive computing environment such as Jupyter Notebook/Lab or IPython, load its extension by executing the following command at the top of your session:

%load_ext lineapy

Please note:

Alternatively, you can launch the environment with the lineapy command, like so:

lineapy jupyter notebook
lineapy jupyter lab
lineapy ipython

This will automatically load the LineaPy extension in the corresponding interactive shell application, and you will not need to manually load it for every new session.

NOTE: If your Jupyter environment has multiple kernels, choose Python 3 (ipykernel) which lineapy defaults to.

CLI

You can also use LineaPy as a CLI command or runnable Python module. To see available options, run the following commands:

# LineaPy as a CLI command
lineapy python --help

or

# LineaPy as a runnable Python module
python -m lineapy --help

Quick Start

Once LineaPy is installed and loaded, you are ready to start using the package. Let's look at a simple example using the Iris dataset to demonstrate how to use LineaPy to 1) store a variable's history, 2) get its cleaned-up code, and 3) build an executable pipeline for the variable.

The following development code fits a linear regression model to the Iris dataset:

import lineapy
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Load data
url = "https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
df = pd.read_csv(url)

# Map each species to a color
color_map = {"Setosa": "green", "Versicolor": "blue", "Virginica": "red"}
df["variety_color"] = df["variety"].map(color_map)

# Plot petal vs. sepal width by species
df.plot.scatter("petal.width", "sepal.width", c="variety_color")
plt.show()

# Create dummy variables encoding species
df["d_versicolor"] = df["variety"].apply(lambda x: 1 if x == "Versicolor" else 0)
df["d_virginica"] = df["variety"].apply(lambda x: 1 if x == "Virginica" else 0)

# Initiate the model
mod = LinearRegression()

# Fit the model
mod.fit(
    X=df[["petal.width", "d_versicolor", "d_virginica"]],
    y=df["sepal.width"],
)

Let's say you're happy with your above code, and you've decided to save the trained model. You can store the model as a LineaPy artifact with the following code:

# Save the model as an artifact
lineapy.save(mod, "iris_model")

A LineaPy artifact encapsulates both the value and code, so you can easily retrieve the model's code, like so:

# Retrieve the model artifact
artifact = lineapy.get("iris_model")

# Check code for the model artifact
print(artifact.get_code())

The print statement will output:

import pandas as pd
from sklearn.linear_model import LinearRegression

url = "https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
df = pd.read_csv(url)
color_map = {"Setosa": "green", "Versicolor": "blue", "Virginica": "red"}
df["variety_color"] = df["variety"].map(color_map)
df["d_versicolor"] = df["variety"].apply(lambda x: 1 if x == "Versicolor" else 0)
df["d_virginica"] = df["variety"].apply(lambda x: 1 if x == "Virginica" else 0)
mod = LinearRegression()
mod.fit(
    X=df[["petal.width", "d_versicolor", "d_virginica"]],
    y=df["sepal.width"],
)

Note that these are the minimal essential steps to produce the model. That is, LineaPy has automatically cleaned up the original code by removing extraneous operations that do not affect the model (e.g., plotting).

Let's say you're asked to retrain the model on a regular basis to account for any updates in the source data. You need to set up a pipeline to train the model — LineaPy makes this as simple as a single function call:

lineapy.to_pipeline(
    artifacts=["iris_model"],
    input_parameters=["url"],  # Specify variable(s) to parametrize
    pipeline_name="iris_model_pipeline",
    output_dir="output/",
    framework="AIRFLOW",
)

This command generates several files that can be used to execute the pipeline from the UI or CLI. (Check this tutorial for more details.)

In short, LineaPy automates time-consuming, manual steps in a data science workflow, helping us get our work to production more quickly and easily.

Usage Reporting

LineaPy collects anonymous usage data that helps our team to improve the product. Only LineaPy's API calls and CLI commands are being reported. We strip out as much potentially sensitive information as possible, and we will never collect user code, data, variable names, or stack traces.

You can opt-out of usage tracking by setting environment variable:

export LINEAPY_DO_NOT_TRACK=true

What Next?

To learn more about LineaPy, please check out the project documentation which contains many examples you can follow with. Some key resources include:

Resource Description
Docs This is our knowledge hub — when in doubt, start here!
Concepts Learn about key concepts underlying LineaPy!
Tutorials These notebook tutorials will help you better understand core functionalities of LineaPy
Use Cases These domain examples illustrate how LineaPy can help in real-world applications
API Reference Need more technical details? This reference may help!
Contribute Want to contribute? These instructions will help you get set up!
Slack Have questions or issues unresolved? Join our community and ask away!