Closed nitishbharti closed 2 years ago
Hi @nitishbharti, sorry to hear you are encountering issues with the plugin.
According the error message, it seems you are trying to log a numpy array as a model. I suspect you are doing something like this:
# in nodes/train_functions.py
def train_model(hyperparameters):
...
model.fit(data)
return model, cv_results # you return a tuple here, with the model and a numpy array
# in pipeline.py
def create_pipeline(**kwargs):
Pipeline(...,
node(train_model,
inputs=["hp_grid"],
outputs=["cv_results", "example_model"] # wrong order: example model refers in reality to the cross validation results which is a numpy array
))
Could you double check what example_model
is? The easiest way is to create a debug environment:
# in conf/debug/catalog.py
example_model:
type: pickle.PickleDataSet
filepath: data/06_models/example_model.pkl
and to run kedro run --pipeline=<your-pipeline> --env=debug
Then load it in a script/notebook:
PROJECT_PATH=path/to/your/project
boostrap_project(PROJECT_PATH)
witrh KedroSession.create(PROJECT_PATH) as session:
context=session.load_context(env="debug")
example_model=context.catalog.load("example_model")
print(example_model) # what is it?
even easier, launch the kedro jupyter notebook
command and directly access the catalog object:
example_model=context.catalog.load("example_model")
print(example_model) # what is it?
If you are absolutely sure the model is a sklearn model and it should work, please provide a minimal reproducible example so that I can investigate further. Something like:
conda create -n bug_pyfunc_flavor python==3.8 -y
conda activate bug_pyfunc_flavor
pip install kedro==<your-version>
pip install kedro-mlflow==<your-version>
kedro new --starter=pandas-iris
cd pandas_iris
pip install -r src/requirements.txt
And then indicate what you have modified.
An example repo I can clone would be even better.
This is a bit off topic, but I have a couple of questions:
mlflow.pyfunc
flavor? I think that if you use the demo tutorial, a simple mlflow.sklearn
flavor should work directly. kedro-mlflow
provides a KedroPipelineModel
custom model to log entire kedro pipelines as model, including preprocessing and post processing and all your artifacts (encoder...)Hi @Galileo-Galilei, My example model is pandas-iris
nodes.py
"""Example code for the nodes in the example pipeline. This code is meant
just for illustrating basic Kedro features.
Delete this when you start working on your own Kedro project.
"""
# pylint: disable=invalid-name
import logging
from typing import Any, Dict
import numpy as np
import pandas as pd
def train_model_(
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
) -> np.ndarray:
"""Node for training a simple multi-class logistic regression model. The
number of training iterations as well as the learning rate are taken from
conf/project/parameters.yml. All of the data as well as the parameters
will be provided to this function at the time of execution.
"""
num_iter = parameters["example_num_train_iter"]
lr = parameters["example_learning_rate"]
X = train_x.to_numpy()
Y = train_y.to_numpy()
# Add bias to the features
bias = np.ones((X.shape[0], 1))
X = np.concatenate((bias, X), axis=1)
weights = []
# Train one model for each class in Y
for k in range(Y.shape[1]):
# Initialise weights
theta = np.zeros(X.shape[1])
y = Y[:, k]
for _ in range(num_iter):
z = np.dot(X, theta)
h = _sigmoid(z)
gradient = np.dot(X.T, (h - y)) / y.size
theta -= lr * gradient
# Save the weights for each model
weights.append(theta)
# Return a joint multi-class model with weights for all classes
return np.vstack(weights).transpose()
def predict(model: np.ndarray, test_x: pd.DataFrame) -> np.ndarray:
"""Node for making predictions given a pre-trained model and a test set."""
X = test_x.to_numpy()
# Add bias to the features
bias = np.ones((X.shape[0], 1))
X = np.concatenate((bias, X), axis=1)
# Predict "probabilities" for each class
result = _sigmoid(np.dot(X, model))
# Return the index of the class with max probability for all samples
return np.argmax(result, axis=1)
def report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None:
"""Node for reporting the accuracy of the predictions performed by the
previous node. Notice that this function has no outputs, except logging.
"""
# Get true class index
target = np.argmax(test_y.to_numpy(), axis=1)
# Calculate accuracy of predictions
accuracy = np.sum(predictions == target) / target.shape[0]
# Log the accuracy of the model
log = logging.getLogger(__name__)
log.info("Model accuracy on test set: %0.2f%%", accuracy * 100)
def _sigmoid(z):
"""A helper sigmoid function used by the training and the scoring nodes."""
return 1 / (1 + np.exp(-z))
pipeline.py
just for illustrating basic Kedro features.
Delete this when you start working on your own Kedro project.
"""
from kedro.pipeline import node, pipeline
from .nodes import predict, report_accuracy, train_model
def create_pipeline(**kwargs):
return pipeline(
[
node(
train_model,
["example_train_x", "example_train_y", "parameters"],
"example_model",
name="train",
),
node(
predict,
dict(model="example_model", test_x="example_test_x"),
"example_predictions",
name="predict",
),
node(
report_accuracy,
["example_predictions", "example_test_y"],
None,
name="report",
),
]
)
catalog.yml
example_iris_data:
type: pandas.CSVDataSet
filepath: "data/01_raw_data/iris.csv"
example_train_x:
type: pandas.CSVDataSet
filepath: data/05_model_input/example_train_x.csv
example_train_y:
type: pandas.CSVDataSet
filepath: data/05_model_input/example_train_y.csv
example_test_x:
type: pandas.CSVDataSet
filepath: data/05_model_input/example_test_x.csv
example_test_y:
type: pandas.CSVDataSet
filepath: data/05_model_input/example_test_y.csv
example_model:
type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
flavor: mlflow.pyfunc
pyfunc_workflow: python_model
example_predictions:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
data_set:
type: pickle.PickleDataSet
filepath: data/07_model_output/example_predictions.pkl
example_metrics:
type: kedro_mlflow.io.metrics.MlflowMetricsDataSet
Error that I am getting
kedro.io.core.DataSetError: Failed while saving data to data set MlflowModelLoggerDataSet(artifact_path=model, flavor=mlflow.pyfunc, load_args={}, pyfunc_workflow=python_model, save_args={'python_model': [[ 0.32379799 1.03632883 -1.26637114]
[ 0.51940546 0.38198792 -1.91295155]
[ 1.79030469 -1.56338428 -2.09118251]
[-2.83214266 0.75899371 2.93411353]
[-1.27816681 -1.78364037 2.94851662]]}).
`python_model` must be a subclass of `PythonModel`. Instead, found an object of type: <class 'numpy.ndarray'>
Yes, so your train_model
function clearly returns a numpy array (np.vstack(weights).transpose()
), which we already knew according to the error.
What do you expect mlflow to do with it? This is not a model (i.e. it is not a class with a predict
method). You can take a look at mlflow documentation to see the list of model flavors which are supported. You should likely use a sklearn
model instead.
You can also:
kedro mlflow modelify
command to convert the entire pipeline as a model, but I really think that using a default sklearn model is enough for demo purpose.
I am getting following error while running kedro with pandas-iris starter -
python_model
must be a subclass ofPythonModel
. Instead, found an object of type: <class 'numpy.ndarray'>My catalog, looks like -
example_model: type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet data_set: type: kedro_mlflow.io.models.MlflowModelSaverDataSet flavor: mlflow.pyfunc pyfunc_workflow: python_model filepath: data/06_models/example_model