canonical / charmed-kubeflow-uats

Automated UATs for Charmed Kubeflow
Apache License 2.0
6 stars 2 forks source link

Configure mlflow-kserve UATs to run behind a proxy #110

Closed DnPlas closed 3 weeks ago

DnPlas commented 3 weeks ago

Context

When running the MLflow UATs from inside a Notebook, we need to be able to run the UATs behind proxy.

What needs to get done

Based on the exploration done in canonical/bundle-kubeflow#76 :

Definition of Done

UATs can be run behind proxy from inside a Notebook

syncronize-issues-to-jira[bot] commented 3 weeks ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6165.

This message was autogenerated

syncronize-issues-to-jira[bot] commented 3 weeks ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6170.

This message was autogenerated

NohaIhab commented 3 weeks ago

I created a notebook with the proxy poddefault, and configured kserve and knative charms for proxy as instructed in the README.md

The first go at running the mlflow-kserve UATs, this cell produced an error:

run, model_uri = experiment(0.5, 0.5)

the error is:

2024[/08/29](http://10.0.141.126/08/29) 09:15:11 WARNING mlflow.utils.autologging_utils: Encountered unexpected error during sklearn autologging: Failed to upload [/tmp/tmpp1qrvqxw/estimator.html](http://10.0.141.126/tmp/tmpp1qrvqxw/estimator.html) to mlflow[/1/bdea74c254e847229c960315da901ec9/artifacts/estimator.html](http://10.0.141.126/1/bdea74c254e847229c960315da901ec9/artifacts/estimator.html): An error occurred (503) when calling the PutObject operation (reached max retries: 4): Service Unavailable
[/opt/conda/lib/python3.11/site-packages/_distutils_hack/__init__.py:26](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/_distutils_hack/__init__.py#line=25): UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
2024[/08/29](http://10.0.141.126/08/29) 09:15:31 INFO mlflow.tracking._tracking_service.client: 🏃 View run wine_models at: http://mlflow-server.kubeflow.svc.cluster.local:5000/#/experiments/1/runs/bdea74c254e847229c960315da901ec9.
2024/08/29 09:15:31 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://mlflow-server.kubeflow.svc.cluster.local:5000/#/experiments/1.

---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
File [/opt/conda/lib/python3.11/site-packages/boto3/s3/transfer.py:372](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/boto3/s3/transfer.py#line=371), in S3Transfer.upload_file(self, filename, bucket, key, callback, extra_args)
    371 try:
--> 372     future.result()
    373 # If a client error was raised, add the backwards compatibility layer
    374 # that raises a S3UploadFailedError. These specific errors were only
    375 # ever thrown for upload_parts but now can be thrown for any related
    376 # client error.

File [/opt/conda/lib/python3.11/site-packages/s3transfer/futures.py:103](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/s3transfer/futures.py#line=102), in TransferFuture.result(self)
     99 try:
    100     # Usually the result() method blocks until the transfer is done,
    101     # however if a KeyboardInterrupt is raised we want want to exit
    102     # out of this and propagate the exception.
--> 103     return self._coordinator.result()
    104 except KeyboardInterrupt as e:

File [/opt/conda/lib/python3.11/site-packages/s3transfer/futures.py:266](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/s3transfer/futures.py#line=265), in TransferCoordinator.result(self)
    265 if self._exception:
--> 266     raise self._exception
    267 return self._result

File [/opt/conda/lib/python3.11/site-packages/s3transfer/tasks.py:139](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/s3transfer/tasks.py#line=138), in Task.__call__(self)
    138     if not self._transfer_coordinator.done():
--> 139         return self._execute_main(kwargs)
    140 except Exception as e:

File [/opt/conda/lib/python3.11/site-packages/s3transfer/tasks.py:162](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/s3transfer/tasks.py#line=161), in Task._execute_main(self, kwargs)
    160 logger.debug(f"Executing task {self} with kwargs {kwargs_to_display}")
--> 162 return_value = self._main(**kwargs)
    163 # If the task is the final task, then set the TransferFuture's
    164 # value to the return value from main().

File [/opt/conda/lib/python3.11/site-packages/s3transfer/upload.py:764](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/s3transfer/upload.py#line=763), in PutObjectTask._main(self, client, fileobj, bucket, key, extra_args)
    763 with fileobj as body:
--> 764     client.put_object(Bucket=bucket, Key=key, Body=body, **extra_args)

File [/opt/conda/lib/python3.11/site-packages/botocore/client.py:565](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/botocore/client.py#line=564), in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
    564 # The "self" in this scope is referring to the BaseClient.
--> 565 return self._make_api_call(operation_name, kwargs)

File [/opt/conda/lib/python3.11/site-packages/botocore/client.py:1017](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/botocore/client.py#line=1016), in BaseClient._make_api_call(self, operation_name, api_params)
   1016     error_class = self.exceptions.from_code(error_code)
-> 1017     raise error_class(parsed_response, operation_name)
   1018 else:

ClientError: An error occurred (503) when calling the PutObject operation (reached max retries: 4): Service Unavailable

During handling of the above exception, another exception occurred:

S3UploadFailedError                       Traceback (most recent call last)
Cell In[10], line 1
----> 1 run, model_uri = experiment(0.5, 0.5)

Cell In[9], line 14, in experiment(alpha, l1_ratio)
     11         mlflow.log_metric("mae", mean_absolute_error(test_y, pred_y))
     13         signature = infer_signature(test_x, pred_y)
---> 14         result = mlflow.sklearn.log_model(lr, "model", registered_model_name="wine-elasticnet", signature=signature)
     15         model_uri = f"{mlflow.get_artifact_uri()}[/](http://10.0.141.126/){result.artifact_path}"
     17 return run, model_uri

File [/opt/conda/lib/python3.11/site-packages/mlflow/sklearn/__init__.py:412](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/sklearn/__init__.py#line=411), in log_model(sk_model, artifact_path, conda_env, code_paths, serialization_format, registered_model_name, signature, input_example, await_registration_for, pip_requirements, extra_pip_requirements, pyfunc_predict_fn, metadata)
    333 @format_docstring(LOG_MODEL_PARAM_DOCS.format(package_name="scikit-learn"))
    334 def log_model(
    335     sk_model,
   (...)
    347     metadata=None,
    348 ):
    349     """
    350     Log a scikit-learn model as an MLflow artifact for the current run. Produces an MLflow Model
    351     containing the following flavors:
   (...)
    410 
    411     """
--> 412     return Model.log(
    413         artifact_path=artifact_path,
    414         flavor=mlflow.sklearn,
    415         sk_model=sk_model,
    416         conda_env=conda_env,
    417         code_paths=code_paths,
    418         serialization_format=serialization_format,
    419         registered_model_name=registered_model_name,
    420         signature=signature,
    421         input_example=input_example,
    422         await_registration_for=await_registration_for,
    423         pip_requirements=pip_requirements,
    424         extra_pip_requirements=extra_pip_requirements,
    425         pyfunc_predict_fn=pyfunc_predict_fn,
    426         metadata=metadata,
    427     )

File [/opt/conda/lib/python3.11/site-packages/mlflow/models/model.py:714](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/models/model.py#line=713), in Model.log(cls, artifact_path, flavor, registered_model_name, await_registration_for, metadata, run_id, resources, **kwargs)
    710 if mlflow_model.signature is None and (
    711     tracking_uri == "databricks" or get_uri_scheme(tracking_uri) == "databricks"
    712 ):
    713     _logger.warning(_LOG_MODEL_MISSING_SIGNATURE_WARNING)
--> 714 mlflow.tracking.fluent.log_artifacts(local_path, mlflow_model.artifact_path, run_id)
    716 # if the model_config kwarg is passed in, then log the model config as an params
    717 if model_config := kwargs.get("model_config"):

File [/opt/conda/lib/python3.11/site-packages/mlflow/tracking/fluent.py:1147](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/tracking/fluent.py#line=1146), in log_artifacts(local_dir, artifact_path, run_id)
   1115 """
   1116 Log all the contents of a local directory as artifacts of the run. If no run is active,
   1117 this method will create a new active run.
   (...)
   1144             mlflow.log_artifacts(tmp_dir, artifact_path="states")
   1145 """
   1146 run_id = run_id or _get_or_start_run().info.run_id
-> 1147 MlflowClient().log_artifacts(run_id, local_dir, artifact_path)

File [/opt/conda/lib/python3.11/site-packages/mlflow/tracking/client.py:1962](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/tracking/client.py#line=1961), in MlflowClient.log_artifacts(self, run_id, local_dir, artifact_path)
   1916 def log_artifacts(
   1917     self, run_id: str, local_dir: str, artifact_path: Optional[str] = None
   1918 ) -> None:
   1919     """Write a directory of files to the remote ``artifact_uri``.
   1920 
   1921     Args:
   (...)
   1960 
   1961     """
-> 1962     self._tracking_client.log_artifacts(run_id, local_dir, artifact_path)

File [/opt/conda/lib/python3.11/site-packages/mlflow/tracking/_tracking_service/client.py:843](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/tracking/_tracking_service/client.py#line=842), in TrackingServiceClient.log_artifacts(self, run_id, local_dir, artifact_path)
    835 def log_artifacts(self, run_id, local_dir, artifact_path=None):
    836     """Write a directory of files to the remote ``artifact_uri``.
    837 
    838     Args:
   (...)
    841 
    842     """
--> 843     self._get_artifact_repo(run_id).log_artifacts(local_dir, artifact_path)

File [/opt/conda/lib/python3.11/site-packages/mlflow/store/artifact/s3_artifact_repo.py:194](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/store/artifact/s3_artifact_repo.py#line=193), in S3ArtifactRepository.log_artifacts(self, local_dir, artifact_path)
    191     upload_path = posixpath.join(dest_path, rel_path)
    193 for f in filenames:
--> 194     self._upload_file(
    195         s3_client=s3_client,
    196         local_file=os.path.join(root, f),
    197         bucket=bucket,
    198         key=posixpath.join(upload_path, f),
    199     )

File [/opt/conda/lib/python3.11/site-packages/mlflow/store/artifact/s3_artifact_repo.py:169](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/mlflow/store/artifact/s3_artifact_repo.py#line=168), in S3ArtifactRepository._upload_file(self, s3_client, local_file, bucket, key)
    167 if environ_extra_args is not None:
    168     extra_args.update(environ_extra_args)
--> 169 s3_client.upload_file(Filename=local_file, Bucket=bucket, Key=key, ExtraArgs=extra_args)

File [/opt/conda/lib/python3.11/site-packages/boto3/s3/inject.py:145](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/boto3/s3/inject.py#line=144), in upload_file(self, Filename, Bucket, Key, ExtraArgs, Callback, Config)
    110 """Upload a file to an S3 object.
    111 
    112 Usage::
   (...)
    142     transfer.
    143 """
    144 with S3Transfer(self, Config) as transfer:
--> 145     return transfer.upload_file(
    146         filename=Filename,
    147         bucket=Bucket,
    148         key=Key,
    149         extra_args=ExtraArgs,
    150         callback=Callback,
    151     )

File [/opt/conda/lib/python3.11/site-packages/boto3/s3/transfer.py:378](http://10.0.141.126/opt/conda/lib/python3.11/site-packages/boto3/s3/transfer.py#line=377), in S3Transfer.upload_file(self, filename, bucket, key, callback, extra_args)
    373 # If a client error was raised, add the backwards compatibility layer
    374 # that raises a S3UploadFailedError. These specific errors were only
    375 # ever thrown for upload_parts but now can be thrown for any related
    376 # client error.
    377 except ClientError as e:
--> 378     raise S3UploadFailedError(
    379         "Failed to upload {} to {}: {}".format(
    380             filename, '[/](http://10.0.141.126/)'.join([bucket, key]), e
    381         )
    382     )

S3UploadFailedError: Failed to upload [/tmp/tmpleaxoi4f/model/model.pkl](http://10.0.141.126/tmp/tmpleaxoi4f/model/model.pkl) to mlflow[/1/bdea74c254e847229c960315da901ec9/artifacts/model/model.pkl](http://10.0.141.126/1/bdea74c254e847229c960315da901ec9/artifacts/model/model.pkl): An error occurred (503) when calling the PutObject operation (reached max retries: 4): Service Unavailable
NohaIhab commented 3 weeks ago

S3 Storage

After discussing with @orfeas-k , we saw the requests to mlflow-minio.kubeflow service go through the proxy, this way it was not able to reach the minio service inside the cluster. To prevent requests from going through the proxy, we simply added .kubeflow to the no_proxy values. See also Orfeas's comment in https://github.com/canonical/charmed-kubeflow-uats/issues/109#issuecomment-2317578876.

KServe

After the S3 Storage part was unblocked, I saw an error in the Inference Service step, it was never going to Ready the description of the isvc pod was:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  5m11s                  default-scheduler  Successfully assigned admin/wine-regressor3-predictor-00001-deployment-646f595dbd-wgws6 to ip-10-0-141-126
  Normal   Pulled     2m31s (x5 over 5m10s)  kubelet            Container image "charmedkubeflow/storage-initializer:0.13.0-70e4564" already present on machine
  Normal   Created    2m31s (x5 over 5m10s)  kubelet            Created container storage-initializer
  Normal   Started    2m31s (x5 over 5m10s)  kubelet            Started container storage-initializer
  Warning  BackOff    6s (x16 over 4m35s)    kubelet            Back-off restarting failed container storage-initializer in pod wine-regressor3-predictor-00001-deployment-646f595dbd-wgws6_admin(362807dc-3b09-4e5b-af32-e7349faab995)

the InitContainer storage-initializer was in BackOff and constantly restarting The logs from the init container:

kubectl logs -n admin wine-regressor3-predictor-00001-deployment-646f595dbd-wgws6 -c storage-initializer
2024-08-29T12:29:45.407Z [pebble] Started daemon.
2024-08-29T12:29:45.416Z [pebble] POST /v1/services 4.40858ms 202
2024-08-29T12:29:45.420Z [pebble] Service "storage-initializer" starting: /storage-initializer/scripts/initializer-entrypoint [ s3://mlflow/1/0e802ccc22e64c6bb2615d1f2e246b6a/artifacts/model /mnt/models ]
2024-08-29T12:29:46.426Z [pebble] GET /v1/changes/1/wait 1.009915388s 200
2024-08-29T12:29:46.427Z [pebble] Started default services with change 1.
2024-08-29T12:29:53.403Z [storage-initializer] 2024-08-29 12:29:53.403 15 kserve INFO [initializer-entrypoint:<module>():16] Initializing, args: src_uri [s3://mlflow/1/0e802ccc22e64c6bb2615d1f2e246b6a/artifacts/model] dest_path[ [/mnt/models]
2024-08-29T12:29:53.404Z [storage-initializer] 2024-08-29 12:29:53.403 15 kserve INFO [storage.py:download():66] Copying contents of s3://mlflow/1/0e802ccc22e64c6bb2615d1f2e246b6a/artifacts/model to local

the init container is stuck on copying the contents from the S3 storage. To resolve this, we need to tell the init container to exclude requests to .kubeflow from going through the proxy. This can be now done by configuring the new charm config no-proxy in kserve-controller. This charm config was recently added as part of https://github.com/canonical/knative-operators/issues/204.