Galileo-Galilei / kedro-mlflow

A kedro-plugin for integration of mlflow capabilities inside kedro projects (especially machine learning model versioning and packaging)
https://kedro-mlflow.readthedocs.io/
Apache License 2.0
197 stars 31 forks source link

Error with mlflow.pyfunc flavour #317

Closed nitishbharti closed 2 years ago

nitishbharti commented 2 years ago

I am getting following error while running kedro with pandas-iris starter - python_model must be a subclass of PythonModel. Instead, found an object of type: <class 'numpy.ndarray'>

My catalog, looks like -

example_model: type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet data_set: type: kedro_mlflow.io.models.MlflowModelSaverDataSet flavor: mlflow.pyfunc pyfunc_workflow: python_model filepath: data/06_models/example_model

Galileo-Galilei commented 2 years ago

Hi @nitishbharti, sorry to hear you are encountering issues with the plugin.

Identifying the error

According the error message, it seems you are trying to log a numpy array as a model. I suspect you are doing something like this:

# in nodes/train_functions.py
def train_model(hyperparameters):
    ...
    model.fit(data)
    return model, cv_results # you return a tuple here, with the model and a numpy array
# in pipeline.py
def create_pipeline(**kwargs):
    Pipeline(...,
      node(train_model, 
            inputs=["hp_grid"], 
            outputs=["cv_results", "example_model"] # wrong order: example model refers in reality to the cross validation results which is a numpy array
))

Could you double check what example_model is? The easiest way is to create a debug environment:

# in conf/debug/catalog.py
example_model:
    type: pickle.PickleDataSet
    filepath: data/06_models/example_model.pkl

and to run kedro run --pipeline=<your-pipeline> --env=debug

Then load it in a script/notebook:

PROJECT_PATH=path/to/your/project
boostrap_project(PROJECT_PATH)
witrh KedroSession.create(PROJECT_PATH) as session:
    context=session.load_context(env="debug")
    example_model=context.catalog.load("example_model")
    print(example_model) # what is it?

even easier, launch the kedro jupyter notebook command and directly access the catalog object:

example_model=context.catalog.load("example_model")
print(example_model) # what is it?

What to do in case if the error persists?

If you are absolutely sure the model is a sklearn model and it should work, please provide a minimal reproducible example so that I can investigate further. Something like:

conda create -n bug_pyfunc_flavor python==3.8 -y
conda activate bug_pyfunc_flavor 
pip install kedro==<your-version>
pip install kedro-mlflow==<your-version>
kedro new --starter=pandas-iris
cd pandas_iris
pip install -r src/requirements.txt

And then indicate what you have modified.

An example repo I can clone would be even better.

Some more comments

This is a bit off topic, but I have a couple of questions:

nitishbharti commented 2 years ago

Hi @Galileo-Galilei, My example model is pandas-iris

nodes.py

"""Example code for the nodes in the example pipeline. This code is meant
just for illustrating basic Kedro features.

Delete this when you start working on your own Kedro project.
"""
# pylint: disable=invalid-name

import logging
from typing import Any, Dict

import numpy as np
import pandas as pd

def train_model_(
    train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
) -> np.ndarray:
    """Node for training a simple multi-class logistic regression model. The
    number of training iterations as well as the learning rate are taken from
    conf/project/parameters.yml. All of the data as well as the parameters
    will be provided to this function at the time of execution.
    """
    num_iter = parameters["example_num_train_iter"]
    lr = parameters["example_learning_rate"]
    X = train_x.to_numpy()
    Y = train_y.to_numpy()

    # Add bias to the features
    bias = np.ones((X.shape[0], 1))
    X = np.concatenate((bias, X), axis=1)

    weights = []
    # Train one model for each class in Y
    for k in range(Y.shape[1]):
        # Initialise weights
        theta = np.zeros(X.shape[1])
        y = Y[:, k]
        for _ in range(num_iter):
            z = np.dot(X, theta)
            h = _sigmoid(z)
            gradient = np.dot(X.T, (h - y)) / y.size
            theta -= lr * gradient
        # Save the weights for each model
        weights.append(theta)

    # Return a joint multi-class model with weights for all classes
    return np.vstack(weights).transpose()

def predict(model: np.ndarray, test_x: pd.DataFrame) -> np.ndarray:
    """Node for making predictions given a pre-trained model and a test set."""
    X = test_x.to_numpy()

    # Add bias to the features
    bias = np.ones((X.shape[0], 1))
    X = np.concatenate((bias, X), axis=1)

    # Predict "probabilities" for each class
    result = _sigmoid(np.dot(X, model))

    # Return the index of the class with max probability for all samples
    return np.argmax(result, axis=1)

def report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None:
    """Node for reporting the accuracy of the predictions performed by the
    previous node. Notice that this function has no outputs, except logging.
    """
    # Get true class index
    target = np.argmax(test_y.to_numpy(), axis=1)
    # Calculate accuracy of predictions
    accuracy = np.sum(predictions == target) / target.shape[0]
    # Log the accuracy of the model
    log = logging.getLogger(__name__)
    log.info("Model accuracy on test set: %0.2f%%", accuracy * 100)

def _sigmoid(z):
    """A helper sigmoid function used by the training and the scoring nodes."""
    return 1 / (1 + np.exp(-z))

pipeline.py

just for illustrating basic Kedro features.

Delete this when you start working on your own Kedro project.
"""

from kedro.pipeline import node, pipeline

from .nodes import predict, report_accuracy, train_model

def create_pipeline(**kwargs):
    return pipeline(
        [
            node(
                train_model,
                ["example_train_x", "example_train_y", "parameters"],
                "example_model",
                name="train",
            ),
            node(
                predict,
                dict(model="example_model", test_x="example_test_x"),
                "example_predictions",
                name="predict",
            ),
            node(
                report_accuracy,
                ["example_predictions", "example_test_y"],
                None,
                name="report",
            ),
        ]
    )

catalog.yml

example_iris_data:
    type: pandas.CSVDataSet
    filepath: "data/01_raw_data/iris.csv"

example_train_x:
  type: pandas.CSVDataSet
  filepath: data/05_model_input/example_train_x.csv

example_train_y:
  type: pandas.CSVDataSet
  filepath: data/05_model_input/example_train_y.csv

example_test_x:
  type: pandas.CSVDataSet
  filepath: data/05_model_input/example_test_x.csv

example_test_y:
  type: pandas.CSVDataSet
  filepath: data/05_model_input/example_test_y.csv

example_model:
  type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
  flavor: mlflow.pyfunc
  pyfunc_workflow: python_model

example_predictions:
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
  data_set:
    type: pickle.PickleDataSet
    filepath: data/07_model_output/example_predictions.pkl

example_metrics:
    type: kedro_mlflow.io.metrics.MlflowMetricsDataSet

Error that I am getting

kedro.io.core.DataSetError: Failed while saving data to data set MlflowModelLoggerDataSet(artifact_path=model, flavor=mlflow.pyfunc, load_args={}, pyfunc_workflow=python_model, save_args={'python_model': [[ 0.32379799  1.03632883 -1.26637114]
 [ 0.51940546  0.38198792 -1.91295155]
 [ 1.79030469 -1.56338428 -2.09118251]
 [-2.83214266  0.75899371  2.93411353]
 [-1.27816681 -1.78364037  2.94851662]]}).
`python_model` must be a subclass of `PythonModel`. Instead, found an object of type: <class 'numpy.ndarray'>
Galileo-Galilei commented 2 years ago

Yes, so your train_model function clearly returns a numpy array (np.vstack(weights).transpose()), which we already knew according to the error.

What do you expect mlflow to do with it? This is not a model (i.e. it is not a class with a predict method). You can take a look at mlflow documentation to see the list of model flavors which are supported. You should likely use a sklearn model instead.

You can also: