Closed fschlz closed 1 week ago
Hi Francisco -- Were you able to serve this model locally using MLServer before using KServe?
@ramonpzg As described in the Slack channel:
I did follow the instructions in the mlserver docs: MLflow: https://mlserver.readthedocs.io/en/latest/examples/mlflow/README.html KServe: https://mlserver.readthedocs.io/en/latest/user-guide/deployment/kserve.html
I also have all the dependency files in the model directory, as you can see on the screenshot. From the docs available to me, it seemed that further environment configurations weren't necessary, especially because the mlserver would simply work locally and seem to setup everything it needed from the given files in the model directory.
As far as I know, there were no instructions on how to generate or use the environment.tar.gz file. Does it need to be in a specific folder within the model directory? Would be great if that could be added to the docs.
Even when I tarball the requirements, conda, and python_env file it doesn't find them:
--> Unpacking environment at /mnt/models/environment.tar.gz...
requirements.txt
python_env.yaml
conda.yaml
Environment not found at './envs/environment'
Do the dependency files need to be inside these directories? ./envs/environment It's unclear where exactly they need to be and what the working directory is.
So I tried the folder structure tarball, but that was also not it:
--> Unpacking environment at /mnt/models/environment.tar.gz...
envs/
envs/environment/
envs/environment/conda.yaml
envs/environment/python_env.yaml
envs/environment/requirements.txt
envs/environments/
Environment not found at './envs/environment'
Finally, I followed this doc, so I tarballed the conda env with conda-pack, I redeployed, and now I am getting this 😅
lib/python3.10/site-packages/mlflow/metrics/genai/prompts/__init__.py
lib/python3.10/site-packages/pandas/io/excel/_odfreader.py
bin/activate
bin/deactivate
bin/conda-unpack
--> Sourcing new environment at ./envs/environment...
--> Calling conda-unpack...
usage: conda-unpack [-h] [--version]
conda-unpack: error: unrecognized arguments: --quiet
--> Disabling user-installed packages...
Can I influence the commands that are run after unpacking the environment?
Thanks for sharing the additional info, Francisco. You mentioned MLServer is working as intended locally but not when you get to the KServe level, I'm wondering if you should ask them what might be the cause of this as KServe is its own project.
The docs do have an example on how to use a custom conda environment. Here is the link..
I used the custom environments, but that didn't change anything. The maintainers from kserve were able to help out: https://github.com/kserve/kserve/issues/3733
Still, I think the environment should be created from the dependency files that are in the model directory, instead of me having to tarball a 250MB conda environment.
@ramonpzg I was trying to figure out the issue connected to what @fschlz mentioned here and also in https://github.com/kserve/kserve/issues/3733 ... short summary:
mlflow.pyfunc.PythonModel
, here, using this notebookmlserver==1.3.5
and mlserver-mlflow==1.3.5
link. and the environment started to be active (i.e. correct pythn version in the logs)mlserver.parallel.errors.WorkerError: builtins.ModuleNotFoundError: No module named 'inference_model'
see logs below:
pmulinka@saiacheron:~/kubernetes/kserve$ kubectl logs custom-success6g-model-predictor-84b9f79b96-bxfnr
Defaulted container "kserve-container" out of: kserve-container, storage-initializer (init)
--> Unpacking environment at /mnt/models/environment.tar.gz...
--> Sourcing new environment at ./envs/environment...
--> Calling conda-unpack...
--> Disabling user-installed packages...
/opt/mlserver/envs/environment/lib/python3.10/site-packages/starlette_exporter/middleware.py:97: FutureWarning: group_paths and filter_unhandled_paths will change defaults from False to True in the next release. See https://github.com/stephenhillier/starlette_exporter/issues/79 for more info
warnings.warn(
2024-06-17 20:45:34,914 [mlserver.parallel] DEBUG - Starting response processing loop...
2024-06-17 20:45:34,917 [mlserver.rest] INFO - HTTP server running on http://0.0.0.0:8080
INFO: Started server process [1]
INFO: Waiting for application startup.
2024-06-17 20:45:35,020 [mlserver.metrics] INFO - Metrics server running on http://0.0.0.0:8082
2024-06-17 20:45:35,020 [mlserver.metrics] INFO - Prometheus scraping endpoint can be accessed on http://0.0.0.0:8082/metrics
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
2024-06-17 20:45:37,323 [mlserver.grpc] INFO - gRPC server running on http://0.0.0.0:9000
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO: Uvicorn running on http://0.0.0.0:8082 (Press CTRL+C to quit)
2024-06-17 20:45:38,909 [mlserver] INFO - Couldn't load model 'custom-success6g-model'. Model will be removed from registry.
2024-06-17 20:45:38,909 [mlserver.parallel] ERROR - An error occurred processing a model update of type 'Load'.
Traceback (most recent call last):
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/parallel/worker.py", line 158, in _process_model_update
await self._model_registry.load(model_settings)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/registry.py", line 293, in load
return await self._models[model_settings.name].load(model_settings)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/registry.py", line 148, in load
await self._load_model(new_model)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/registry.py", line 165, in _load_model
model.ready = await model.load()
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver_mlflow/runtime.py", line 155, in load
self._model = mlflow.pyfunc.load_model(model_uri)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlflow/tracing/provider.py", line 251, in wrapper
return f(*args, **kwargs)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlflow/pyfunc/__init__.py", line 1028, in load_model
raise e
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlflow/pyfunc/__init__.py", line 1013, in load_model
model_impl = importlib.import_module(conf[MAIN])._load_pyfunc(data_path)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlflow/pyfunc/model.py", line 550, in _load_pyfunc
context, python_model, signature = _load_context_model_and_signature(model_path, model_config)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlflow/pyfunc/model.py", line 533, in _load_context_model_and_signature
python_model = cloudpickle.load(f)
ModuleNotFoundError: No module named 'inference_model'
2024-06-17 20:45:38,912 [mlserver] INFO - Couldn't load model 'custom-success6g-model'. Model will be removed from registry.
2024-06-17 20:45:38,917 [mlserver.parallel] ERROR - An error occurred processing a model update of type 'Unload'.
Traceback (most recent call last):
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/parallel/worker.py", line 160, in _process_model_update
await self._model_registry.unload_version(
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/registry.py", line 302, in unload_version
await model_registry.unload_version(version)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/registry.py", line 201, in unload_version
model = await self.get_model(version)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/registry.py", line 237, in get_model
raise ModelNotFound(self._name, version)
mlserver.errors.ModelNotFound: Model custom-success6g-model not found
2024-06-17 20:45:38,918 [mlserver] ERROR - Some of the models failed to load during startup!
Traceback (most recent call last):
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/server.py", line 125, in start
await asyncio.gather(
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/registry.py", line 293, in load
return await self._models[model_settings.name].load(model_settings)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/registry.py", line 148, in load
await self._load_model(new_model)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/registry.py", line 161, in _load_model
model = await callback(model)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/parallel/registry.py", line 152, in load_model
loaded = await pool.load_model(model)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/parallel/pool.py", line 74, in load_model
await self._dispatcher.dispatch_update(load_message)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 113, in dispatch_update
return await asyncio.gather(
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 128, in _dispatch_update
return await self._dispatch(worker_update)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 138, in _dispatch
return await self._wait_response(internal_id)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 144, in _wait_response
inference_response = await async_response
mlserver.parallel.errors.WorkerError: builtins.ModuleNotFoundError: No module named 'inference_model'
2024-06-17 20:45:38,919 [mlserver.parallel] INFO - Waiting for shutdown of default inference pool...
2024-06-17 20:45:39,155 [mlserver.parallel] INFO - Shutdown of default inference pool complete
2024-06-17 20:45:39,155 [mlserver.grpc] INFO - Waiting for gRPC server shutdown
2024-06-17 20:45:39,157 [mlserver.grpc] INFO - gRPC server shutdown complete
INFO: Shutting down
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [1]
INFO: Application shutdown complete.
INFO: Finished server process [1]
Traceback (most recent call last):
File "/opt/mlserver/envs/environment/bin/mlserver", line 8, in <module>
sys.exit(main())
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/cli/main.py", line 263, in main
root()
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/cli/main.py", line 23, in wrapper
return asyncio.run(f(*args, **kwargs))
File "/opt/mlserver/envs/environment/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/cli/main.py", line 47, in start
await server.start(models_settings)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/server.py", line 137, in start
await servers_task
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/mlserver/rest/server.py", line 71, in start
await self._server.serve()
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/uvicorn/server.py", line 68, in serve
with self.capture_signals():
File "/opt/mlserver/envs/environment/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/opt/mlserver/envs/environment/lib/python3.10/site-packages/uvicorn/server.py", line 328, in capture_signals
signal.raise_signal(captured_signal)
TypeError: 'NoneType' object cannot be interpreted as an integer
No module named 'inference_model'
mean ... No module named 'inference_model'
issue, but I am lost...Additional info: seldon deployment:
pmulinka@saiacheron:~/kubernetes/seldon-core$ cat mlflow-seldon-core-success6g_model_uri.yaml
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: "success6g-model"
namespace: "mlflow-seldon-core-success6g"
spec:
protocol: v2 # Activate the v2 protocol
name: wines
predictors:
- graph:
children: []
implementation: MLFLOW_SERVER
modelUri: "s3://mlflow/3/e3976058b0604e6fa40c070b196672bb/artifacts/custom-success6g-model"
envSecretRefName: seldon-rclone-secret
name: classifier
name: default
replicas: 1
kserver deployment:
pmulinka@saiacheron:~/kubernetes$ cat kserve/mlflow-kserve-success6g_model_uri.yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "custom-success6g-model"
namespace: "mlflow-kserve-success6g"
spec:
predictor:
serviceAccountName: success6g
model:
modelFormat:
name: mlflow
protocolVersion: v2
storageUri: "s3://mlflow/3/e3976058b0604e6fa40c070b196672bb/artifacts/custom-success6g-model"
FYI, this has clearly something to do just with combination of mlflow + mlserve + pyfunc ... just for a test I tried to save the model to mlflow using mlflow.lightgbm.log_model
as I am using lightgbm model as part of the Trainer
object that does preprocessing of the data and some other functionalities
deploying a model that was saved like this(using Kserve deployment from the previous post) works
mlflow.lightgbm.log_model(
lgb_model=trainer.model,
artifact_path=artifact_path,
registered_model_name=registered_model_name)
output after saving just a lightgbm model:
pmulinka@saiacheron:~/kubernetes/kserve$ kubectl logs custom-success6g-model-predictor-78f74f554c-829j2
Defaulted container "kserve-container" out of: kserve-container, storage-initializer (init)
--> Unpacking environment at /mnt/models/environment.tar.gz...
--> Sourcing new environment at ./envs/environment...
--> Calling conda-unpack...
--> Disabling user-installed packages...
/opt/mlserver/envs/environment/lib/python3.10/site-packages/starlette_exporter/middleware.py:97: FutureWarning: group_paths and filter_unhandled_paths will change defaults from False to True in the next release. See https://github.com/stephenhillier/starlette_exporter/issues/79 for more info
warnings.warn(
2024-06-17 21:08:34,330 [mlserver.parallel] DEBUG - Starting response processing loop...
2024-06-17 21:08:34,392 [mlserver.rest] INFO - HTTP server running on http://0.0.0.0:8080
INFO: Started server process [1]
INFO: Waiting for application startup.
2024-06-17 21:08:34,503 [mlserver.metrics] INFO - Metrics server running on http://0.0.0.0:8082
2024-06-17 21:08:34,504 [mlserver.metrics] INFO - Prometheus scraping endpoint can be accessed on http://0.0.0.0:8082/metrics
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
2024-06-17 21:08:36,835 [mlserver.grpc] INFO - gRPC server running on http://0.0.0.0:9000
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO: Uvicorn running on http://0.0.0.0:8082 (Press CTRL+C to quit)
2024/06/17 21:08:38 WARNING mlflow.utils.requirements_utils: Detected one or more mismatches between the model's dependencies and the current Python environment:
- cffi (current: 1.16.0, required: cffi==1.15.1)
To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.
2024-06-17 21:08:40,606 [mlserver] INFO - Loaded model 'custom-success6g-model' succesfully.
2024-06-17 21:08:40,616 [mlserver] INFO - Loaded model 'custom-success6g-model' succesfully.
deploying a model that was saved like this does not work (i.e. No module named 'inference_model'
issue)
mlflow.pyfunc.log_model(python_model=trainer,
artifact_path=artifact_path,
registered_model_name=registered_model_name)
ok, ehm, I am deeply sorry, it seems like I was looking for the inference_model
all over the internet and it is the code that is connected to my Trainer
object ... when I included the code in the mlflow.pyfunc.log_model it started to work....sorry...
mlflow.pyfunc.log_model(python_model=trainer,
artifact_path=artifact_path,
registered_model_name=registered_model_name,
code_paths=["inference_model"])
No problem. I am glad you figured it out @5uperpalo and @fschlz. I will go ahead and close out this issue. If anything changes, please open it again with the details of what has changed. Thanks :)
What steps did you take and what happened:
Hey guys, I am trying to use KServer on AKS. I installed all the dependencies on AKS and am trying to deploy a test inference service. However, the model isn't getting loaded correctly. Locally, everything works out right. Unfortunately, the service doesn't seem to recognize the model files I have registered. Plus, the environment that is created doesn't seem to respect the version numbers that are set in
requirements.txt
Does anyone know what could be wrong?
What did you expect to happen:
Deploy the inference service.
What's the InferenceService yaml:
Anything else you would like to add:
These are the model files in my Storage Account (see screenshot)![Image from kserve](https://github.com/kserve/kserve/assets/22344801/612c580c-8c3a-4580-9fa2-f35575338964)
Environment:
/etc/os-release
): ubuntu 22.04