Galileo-Galilei / kedro-mlflow-tutorial

A tutorial on how to use kedro-mlflow plugin (https://github.com/Galileo-Galilei/kedro-mlflow) to synchronize training and inference and serve kedro pipeline
37 stars 5 forks source link
kedro kedro-mlflow kedro-tutorial mlflow mlops model-serving

kedro mlflow tutorial

Introduction

Pre-requisite

This tutorial assumes the user is familiar with Kedro. We will refer to a kedro>=0.18.0, <0.19.0 project template files.

If you want to check out for older kedro version, see:

Goal of the tutorial

This tutorial shows how to use kedro-mlflow plugin as a mlops framework.

Specifically, it will focus on how one can use the pipeline_ml_factory to maintain consistency between training and inference and prepare deployment. It will show best practices on code organization to ensure easy transition to deployment and strong reproducibility.

We will not emphasize the fact that kedro-mlflow provides advanced versioning capabilities tracking, including automatic parameters tracking, but look at the documentation to see what it is capable of!

Disclaimer

This is NOT a Kaggle competition. I will not try to create the best model (nor even a good model) to solve this problem. This should not be considered as data science best practices on how to train a model. Each pipeline node has a specific educational purpose to explain one use case kedro-mlflow can handle.

Installation

  1. Clone the repo:
git clone https://github.com/Galileo-Galilei/kedro-mlflow-tutorial
cd kedro-mlflow-tutorial
  1. Install dependencies:
conda create -n kedro_mlflow_tutorial python=3.9
conda activate kedro_mlflow_tutorial
pip install -e src

Note: You don't need to call kedro mlflow init command as you do in a freshly created repo since the mlflow.yml is already pre-configured.

Project components

Introducing the task and dataset

We will use the IMDB movie review dataset as an example. This dataset contains 50k movie reviews with associated "positive" or "negative" value manually labelled by a human.

We will train a classifier for binary classification to predict the sentiment associated to a movie review.

You can find many notebooks on Kaggle to learn more about this dataset.

Folder architecture

The project is divided into 3 applications (i.e. subfolders in the src/kedro_mlflow_tutorial/pipelines). The reasons for such a division is detailed in [kedro-mlflow's documentation]():

For the sake of simplicity and educational purpose, we will keep the etl and user_app pipelines very simple and focus on the ml pipelines. In real life, etl and user_app may be very complex.

Create the instances and labels datasets

To create the instances and labels datasets, run the etl pipelines:

kedro run --pipeline=etl_instances
kedro run --pipeline=etl_labels

Since they are persisted in the catalog.yml file, you will be able to reuse them afterwards.

Note: You can [change the huggingface_split parameters in globals.yml](https://github.com/Galileo-Galilei/kedro-mlflow-tutorial/blob/main/conf/base/globals.yml#L1) and rerun the pipelines to create test data.

Pipeline packaging and autologging

Bind your training and inference pipelines declaratively

The key part is to convert your training pipeline from a Pipeline kedro object to a PipelineML kedro-mlflow object.

This can be done in the pipeline_registry.py file thanks to the pipeline_ml_factory helper function.

The register_pipeline hook of thepipeline_registry.py looks like this (below snippet is slightly simplified for readability):

from kedro_mlflow_tutorial.pipelines.ml_app.pipeline import create_ml_pipeline

...

def register_pipelines() -> Dict[str, Pipeline]:

    ...

    ml_pipeline = create_ml_pipeline()
    inference_pipeline = ml_pipeline.only_nodes_with_tags("inference")
    training_pipeline_ml = pipeline_ml_factory(
        training=ml_pipeline.only_nodes_with_tags("training"),
        inference=inference_pipeline,
        input_name="instances",
        log_model_kwargs=dict(
            artifact_path="kedro_mlflow_tutorial",
            conda_env={
                "python": 3.9.12,
                "build_dependencies": ["pip"],
                "dependencies": [f"kedro_mlflow_tutorial=={PROJECT_VERSION}"],
            },
            signature="auto",
        ),
    )
    ...

    return {
        "training": training_pipeline_ml,
    }

Let's break it down:

Create your ml application

The ml application (which contains both the training and inference pipelines) can be created step by step. The goal is to tag each node as either ["training"], ["inference"] or ["training", "inference"]. This enables to share nodes and ensure consistency between the two pipelines.

You can encounter the following use cases:

# catalog.yml

english_stopwords:
  type: yaml.YAMLDataSet  # <- This must be any Kedro Dataset other than "MemoryDataSet"
  filepath: data/01_raw/stopwords.yml  # <- This must be a local path, no matter what is your mlflow storage (S3 or other)
# catalog.yml

xgb_feature_importance:
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
  data_set:
    type: matplotlib.MatplotlibWriter
    filepath: data\08_reporting\xgb_feature_importance.png
# catalog.yml

label_encoder:
  type: pickle.PickleDataSet
  filepath: data/06_models/label_encoder.pkl

See autologging in action

Once you have declared your training pipeline as a PipelineML object, the associated inference pipeline will be logged automatically at the end of the execution along with:

  1. Run the pipeline
kedro run --pipeline=training
  1. Open the UI
kedro mlflow ui

mlflow_landing_page

  1. Navigate to the last "training" run:

mlflow_run_page

The parameters have been automatically recorded! For the metrics, you can set them in the catalog.yml.

  1. Go the artifacts section:

pipeline_inference_model

You can see:

On this picture, we can also see the extra image "xgb_feature_importance.png" logged after model training.

By following these simple steps (basically ~5 lines of code to declare our training and inference pipeline in pipeline_registry.py with pipeline_ml_factory), we have a perfect synchronicity between our training and inference pipelines. Each code change (adding a node or modify a function), parameter changes or data changes (through artifacts fitting) are automatically resolved. You are now sure that you will be able to predict from any old run in one line of code!

Serve the inference pipeline to an end user

Scenario 1: Reuse from a python script

If anyone else want to reuse your model from python, the load_model function of mlflow is what you need:

PROJECT_PATH = r"<your/project/path>"
RUN_ID = "<your-run-id>"

from kedro.framework.startup import bootstrap_project
from kedro.framework.session import KedroSession
from mlflow.pyfunc import load_model

bootstrap_project(PROJECT_PATH)
session=Kedrosession.create(
  session_id=1,
  project_path=PROJECT_PATH,
  package_name="kedro_mlflow_tutorial",
)
local_context = session.load_context() # setup mlflow config

instances = local_context.io.load("instances")
model = load_model(f"runs:/{RUN_ID}/kedro_mlflow_tutorial")

predictions = model.predict(instances)

The predictions object is a pandas.DataFrame and can be handled as usual.

Scenario 2: Reuse in a kedro pipeline

Say that you want to reuse this trained model in a kedro Pipeline, like the user_app. The easiest way to do it is to add the model in the catalog.yml file

pipeline_inference_model:
  type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
  flavor: mlflow.pyfunc
  pyfunc_workflow: python_model
  artifact_path: kedro_mlflow_tutorial  # the name of your mlflow folder = the model_name in pipeline_ml_factory
  run_id: <your-run-id>  # put it in globals.yml to help people find out what to modify

Then you can reuse it in a node to predict with this model which is the entire inference pipeline at the time you launched the training.

An example is given in the user_app folder.

To try it out:

Scenario 3: Serve the model with mlflow

The two previous scenarii assume that your end user will use python (or even more restrictive, kedro) to load the model and predict with it. For many applications, the real "user app" which consume your pipeline is not written in python, and is even not aware of your code.

Fortunately, mlflow provide helpers to serve the model as an API with one line of code:

mlflow models serve -m "runs:/<your-model-run-id>/kedro_mlflow_tutorial"

This will serve your model as an API (beware: there are known issues on windows). You can test it with:

curl -d "{\"columns\":[\"text\"],\"index\":[0,1],\"data\":[[\"This movie is cool\"],[\"awful film\"]]}" -H "Content-Type: application/json"  localhost:5000/invocations

The most common way to deploy it is to dockerize it, but this is beyond the scope of this tutorial. Mlflow provides a lot of documentation on deployment on different target platforms.