Closed DnPlas closed 3 months ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6165.
This message was autogenerated
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6170.
This message was autogenerated
I created a notebook with the proxy poddefault, and configured kserve and knative charms for proxy as instructed in the README.md
The first go at running the mlflow-kserve UATs, this cell produced an error:
run, model_uri = experiment(0.5, 0.5)
the error is:
2024[/08/29](http://10.0.141.126/08/29) 09:15:11 WARNING mlflow.utils.autologging_utils: Encountered unexpected error during sklearn autologging: Failed to upload [/tmp/tmpp1qrvqxw/estimator.html](http://10.0.141.126/tmp/tmpp1qrvqxw/estimator.html) to mlflow[/1/bdea74c254e847229c960315da901ec9/artifacts/estimator.html](http://10.0.141.126/1/bdea74c254e847229c960315da901ec9/artifacts/estimator.html): An error occurred (503) when calling the PutObject operation (reached max retries: 4): Service Unavailable
[/opt/conda/lib/python3.11/site-packages/_distutils_hack/__init__.py:26](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/_distutils_hack/__init__.py#line=25): UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
2024[/08/29](http://10.0.141.126/08/29) 09:15:31 INFO mlflow.tracking._tracking_service.client: 🏃 View run wine_models at: http://mlflow-server.kubeflow.svc.cluster.local:5000/#/experiments/1/runs/bdea74c254e847229c960315da901ec9.
2024/08/29 09:15:31 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://mlflow-server.kubeflow.svc.cluster.local:5000/#/experiments/1.
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
File [/opt/conda/lib/python3.11/site-packages/boto3/s3/transfer.py:372](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/boto3/s3/transfer.py#line=371), in S3Transfer.upload_file(self, filename, bucket, key, callback, extra_args)
371 try:
--> 372 future.result()
373 # If a client error was raised, add the backwards compatibility layer
374 # that raises a S3UploadFailedError. These specific errors were only
375 # ever thrown for upload_parts but now can be thrown for any related
376 # client error.
File [/opt/conda/lib/python3.11/site-packages/s3transfer/futures.py:103](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/s3transfer/futures.py#line=102), in TransferFuture.result(self)
99 try:
100 # Usually the result() method blocks until the transfer is done,
101 # however if a KeyboardInterrupt is raised we want want to exit
102 # out of this and propagate the exception.
--> 103 return self._coordinator.result()
104 except KeyboardInterrupt as e:
File [/opt/conda/lib/python3.11/site-packages/s3transfer/futures.py:266](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/s3transfer/futures.py#line=265), in TransferCoordinator.result(self)
265 if self._exception:
--> 266 raise self._exception
267 return self._result
File [/opt/conda/lib/python3.11/site-packages/s3transfer/tasks.py:139](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/s3transfer/tasks.py#line=138), in Task.__call__(self)
138 if not self._transfer_coordinator.done():
--> 139 return self._execute_main(kwargs)
140 except Exception as e:
File [/opt/conda/lib/python3.11/site-packages/s3transfer/tasks.py:162](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/s3transfer/tasks.py#line=161), in Task._execute_main(self, kwargs)
160 logger.debug(f"Executing task {self} with kwargs {kwargs_to_display}")
--> 162 return_value = self._main(**kwargs)
163 # If the task is the final task, then set the TransferFuture's
164 # value to the return value from main().
File [/opt/conda/lib/python3.11/site-packages/s3transfer/upload.py:764](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/s3transfer/upload.py#line=763), in PutObjectTask._main(self, client, fileobj, bucket, key, extra_args)
763 with fileobj as body:
--> 764 client.put_object(Bucket=bucket, Key=key, Body=body, **extra_args)
File [/opt/conda/lib/python3.11/site-packages/botocore/client.py:565](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/botocore/client.py#line=564), in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
564 # The "self" in this scope is referring to the BaseClient.
--> 565 return self._make_api_call(operation_name, kwargs)
File [/opt/conda/lib/python3.11/site-packages/botocore/client.py:1017](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/botocore/client.py#line=1016), in BaseClient._make_api_call(self, operation_name, api_params)
1016 error_class = self.exceptions.from_code(error_code)
-> 1017 raise error_class(parsed_response, operation_name)
1018 else:
ClientError: An error occurred (503) when calling the PutObject operation (reached max retries: 4): Service Unavailable
During handling of the above exception, another exception occurred:
S3UploadFailedError Traceback (most recent call last)
Cell In[10], line 1
----> 1 run, model_uri = experiment(0.5, 0.5)
Cell In[9], line 14, in experiment(alpha, l1_ratio)
11 mlflow.log_metric("mae", mean_absolute_error(test_y, pred_y))
13 signature = infer_signature(test_x, pred_y)
---> 14 result = mlflow.sklearn.log_model(lr, "model", registered_model_name="wine-elasticnet", signature=signature)
15 model_uri = f"{mlflow.get_artifact_uri()}[/](http://10.0.141.126/){result.artifact_path}"
17 return run, model_uri
File [/opt/conda/lib/python3.11/site-packages/mlflow/sklearn/__init__.py:412](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/sklearn/__init__.py#line=411), in log_model(sk_model, artifact_path, conda_env, code_paths, serialization_format, registered_model_name, signature, input_example, await_registration_for, pip_requirements, extra_pip_requirements, pyfunc_predict_fn, metadata)
333 @format_docstring(LOG_MODEL_PARAM_DOCS.format(package_name="scikit-learn"))
334 def log_model(
335 sk_model,
(...)
347 metadata=None,
348 ):
349 """
350 Log a scikit-learn model as an MLflow artifact for the current run. Produces an MLflow Model
351 containing the following flavors:
(...)
410
411 """
--> 412 return Model.log(
413 artifact_path=artifact_path,
414 flavor=mlflow.sklearn,
415 sk_model=sk_model,
416 conda_env=conda_env,
417 code_paths=code_paths,
418 serialization_format=serialization_format,
419 registered_model_name=registered_model_name,
420 signature=signature,
421 input_example=input_example,
422 await_registration_for=await_registration_for,
423 pip_requirements=pip_requirements,
424 extra_pip_requirements=extra_pip_requirements,
425 pyfunc_predict_fn=pyfunc_predict_fn,
426 metadata=metadata,
427 )
File [/opt/conda/lib/python3.11/site-packages/mlflow/models/model.py:714](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/models/model.py#line=713), in Model.log(cls, artifact_path, flavor, registered_model_name, await_registration_for, metadata, run_id, resources, **kwargs)
710 if mlflow_model.signature is None and (
711 tracking_uri == "databricks" or get_uri_scheme(tracking_uri) == "databricks"
712 ):
713 _logger.warning(_LOG_MODEL_MISSING_SIGNATURE_WARNING)
--> 714 mlflow.tracking.fluent.log_artifacts(local_path, mlflow_model.artifact_path, run_id)
716 # if the model_config kwarg is passed in, then log the model config as an params
717 if model_config := kwargs.get("model_config"):
File [/opt/conda/lib/python3.11/site-packages/mlflow/tracking/fluent.py:1147](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/tracking/fluent.py#line=1146), in log_artifacts(local_dir, artifact_path, run_id)
1115 """
1116 Log all the contents of a local directory as artifacts of the run. If no run is active,
1117 this method will create a new active run.
(...)
1144 mlflow.log_artifacts(tmp_dir, artifact_path="states")
1145 """
1146 run_id = run_id or _get_or_start_run().info.run_id
-> 1147 MlflowClient().log_artifacts(run_id, local_dir, artifact_path)
File [/opt/conda/lib/python3.11/site-packages/mlflow/tracking/client.py:1962](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/tracking/client.py#line=1961), in MlflowClient.log_artifacts(self, run_id, local_dir, artifact_path)
1916 def log_artifacts(
1917 self, run_id: str, local_dir: str, artifact_path: Optional[str] = None
1918 ) -> None:
1919 """Write a directory of files to the remote ``artifact_uri``.
1920
1921 Args:
(...)
1960
1961 """
-> 1962 self._tracking_client.log_artifacts(run_id, local_dir, artifact_path)
File [/opt/conda/lib/python3.11/site-packages/mlflow/tracking/_tracking_service/client.py:843](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/tracking/_tracking_service/client.py#line=842), in TrackingServiceClient.log_artifacts(self, run_id, local_dir, artifact_path)
835 def log_artifacts(self, run_id, local_dir, artifact_path=None):
836 """Write a directory of files to the remote ``artifact_uri``.
837
838 Args:
(...)
841
842 """
--> 843 self._get_artifact_repo(run_id).log_artifacts(local_dir, artifact_path)
File [/opt/conda/lib/python3.11/site-packages/mlflow/store/artifact/s3_artifact_repo.py:194](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/store/artifact/s3_artifact_repo.py#line=193), in S3ArtifactRepository.log_artifacts(self, local_dir, artifact_path)
191 upload_path = posixpath.join(dest_path, rel_path)
193 for f in filenames:
--> 194 self._upload_file(
195 s3_client=s3_client,
196 local_file=os.path.join(root, f),
197 bucket=bucket,
198 key=posixpath.join(upload_path, f),
199 )
File [/opt/conda/lib/python3.11/site-packages/mlflow/store/artifact/s3_artifact_repo.py:169](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/store/artifact/s3_artifact_repo.py#line=168), in S3ArtifactRepository._upload_file(self, s3_client, local_file, bucket, key)
167 if environ_extra_args is not None:
168 extra_args.update(environ_extra_args)
--> 169 s3_client.upload_file(Filename=local_file, Bucket=bucket, Key=key, ExtraArgs=extra_args)
File [/opt/conda/lib/python3.11/site-packages/boto3/s3/inject.py:145](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/boto3/s3/inject.py#line=144), in upload_file(self, Filename, Bucket, Key, ExtraArgs, Callback, Config)
110 """Upload a file to an S3 object.
111
112 Usage::
(...)
142 transfer.
143 """
144 with S3Transfer(self, Config) as transfer:
--> 145 return transfer.upload_file(
146 filename=Filename,
147 bucket=Bucket,
148 key=Key,
149 extra_args=ExtraArgs,
150 callback=Callback,
151 )
File [/opt/conda/lib/python3.11/site-packages/boto3/s3/transfer.py:378](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/boto3/s3/transfer.py#line=377), in S3Transfer.upload_file(self, filename, bucket, key, callback, extra_args)
373 # If a client error was raised, add the backwards compatibility layer
374 # that raises a S3UploadFailedError. These specific errors were only
375 # ever thrown for upload_parts but now can be thrown for any related
376 # client error.
377 except ClientError as e:
--> 378 raise S3UploadFailedError(
379 "Failed to upload {} to {}: {}".format(
380 filename, '[/](http://10.0.141.126/)'.join([bucket, key]), e
381 )
382 )
S3UploadFailedError: Failed to upload [/tmp/tmpleaxoi4f/model/model.pkl](http://10.0.141.126/tmp/tmpleaxoi4f/model/model.pkl) to mlflow[/1/bdea74c254e847229c960315da901ec9/artifacts/model/model.pkl](http://10.0.141.126/1/bdea74c254e847229c960315da901ec9/artifacts/model/model.pkl): An error occurred (503) when calling the PutObject operation (reached max retries: 4): Service Unavailable
After discussing with @orfeas-k , we saw the requests to mlflow-minio.kubeflow
service go through the proxy, this way it was not able to reach the minio service inside the cluster. To prevent requests from going through the proxy, we simply added .kubeflow
to the no_proxy
values. See also Orfeas's comment in https://github.com/canonical/charmed-kubeflow-uats/issues/109#issuecomment-2317578876.
After the S3 Storage part was unblocked, I saw an error in the Inference Service step, it was never going to Ready
the description of the isvc pod was:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m11s default-scheduler Successfully assigned admin/wine-regressor3-predictor-00001-deployment-646f595dbd-wgws6 to ip-10-0-141-126
Normal Pulled 2m31s (x5 over 5m10s) kubelet Container image "charmedkubeflow/storage-initializer:0.13.0-70e4564" already present on machine
Normal Created 2m31s (x5 over 5m10s) kubelet Created container storage-initializer
Normal Started 2m31s (x5 over 5m10s) kubelet Started container storage-initializer
Warning BackOff 6s (x16 over 4m35s) kubelet Back-off restarting failed container storage-initializer in pod wine-regressor3-predictor-00001-deployment-646f595dbd-wgws6_admin(362807dc-3b09-4e5b-af32-e7349faab995)
the InitContainer storage-initializer
was in BackOff
and constantly restarting
The logs from the init container:
kubectl logs -n admin wine-regressor3-predictor-00001-deployment-646f595dbd-wgws6 -c storage-initializer
2024-08-29T12:29:45.407Z [pebble] Started daemon.
2024-08-29T12:29:45.416Z [pebble] POST /v1/services 4.40858ms 202
2024-08-29T12:29:45.420Z [pebble] Service "storage-initializer" starting: /storage-initializer/scripts/initializer-entrypoint [ s3://mlflow/1/0e802ccc22e64c6bb2615d1f2e246b6a/artifacts/model /mnt/models ]
2024-08-29T12:29:46.426Z [pebble] GET /v1/changes/1/wait 1.009915388s 200
2024-08-29T12:29:46.427Z [pebble] Started default services with change 1.
2024-08-29T12:29:53.403Z [storage-initializer] 2024-08-29 12:29:53.403 15 kserve INFO [initializer-entrypoint:<module>():16] Initializing, args: src_uri [s3://mlflow/1/0e802ccc22e64c6bb2615d1f2e246b6a/artifacts/model] dest_path[ [/mnt/models]
2024-08-29T12:29:53.404Z [storage-initializer] 2024-08-29 12:29:53.403 15 kserve INFO [storage.py:download():66] Copying contents of s3://mlflow/1/0e802ccc22e64c6bb2615d1f2e246b6a/artifacts/model to local
the init container is stuck on copying the contents from the S3 storage.
To resolve this, we need to tell the init container to exclude requests to .kubeflow
from going through the proxy. This can be now done by configuring the new charm config no-proxy
in kserve-controller
. This charm config was recently added as part of https://github.com/canonical/knative-operators/issues/204.
Context
When running the MLflow UATs from inside a Notebook, we need to be able to run the UATs behind proxy.
What needs to get done
Based on the exploration done in canonical/bundle-kubeflow#76 :
Definition of Done
UATs can be run behind proxy from inside a Notebook