Closed NohaIhab closed 3 months ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6041.
This message was autogenerated
The knative service is failing to start because it is unable to resolve the image digest, we expect this to be due to not being able to reach the internet to get the digest. What we can do here:
I suggest we go with (1) because skipping tag resolution is rather a workaround, not a solution. We should have our charm expose the configuration of proxy env vars in the serving controller.
to configure the proxy envs in the serving controller:
knative-serving
CR manifest as follows:
--- a/charms/knative-serving/src/manifests/KnativeServing.yaml.j2
+++ b/charms/knative-serving/src/manifests/KnativeServing.yaml.j2
@@ -6,6 +6,17 @@ metadata:
namespace: {{ serving_namespace }}
spec:
version: {{ serving_version }}
+ workloads:
+ - name: controller
+ env:
+ - container: controller
+ envVars:
+ - name: HTTP_PROXY
+ value: http://10.0.13.50:3128
+ - name: HTTPS_PROXY
+ value: http://10.0.13.50:3128
+ - name: NO_PROXY
+ value: 10.152.183.0/24
config:
deployment:
progress-deadline: {{ progress_deadline}}
where:
HTTP_PROXY
and HTTPS_PROXY
have the values of the proxy serverNO_PROXY
has the value of the service cluster ip range:
cat /var/snap/microk8s/current/args/kube-apiserver | grep service-cluster-ip-range
--service-cluster-ip-range=10.152.183.0/24
knative-serving
charmknative-serving
charm from latest/beta
to the local oneV1ObjectMeta
of ISVC definitionUnknown
state with Reason RevisionMissing
Name: sklearn-iris-predictor
Namespace: admin
Labels: component=predictor
notebook-proxy=true
serving.kserve.io/inferenceservice=sklearn-iris
Annotations: serving.knative.dev/creator: system:serviceaccount:kubeflow:kserve-controller
serving.knative.dev/lastModifier: system:serviceaccount:kubeflow:kserve-controller
API Version: serving.knative.dev/v1
Kind: Service
Metadata:
Creation Timestamp: 2024-07-24T10:42:30Z
Generation: 1
Owner References:
API Version: serving.kserve.io/v1beta1
Block Owner Deletion: true
Controller: true
Kind: InferenceService
Name: sklearn-iris
UID: 3a249bb2-f2f5-492d-a48f-7ac5943822dc
Resource Version: 87248
UID: aa001a12-98e7-4c22-ba3d-922c23b91215
Spec:
Template:
Metadata:
Annotations:
autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/min-scale: 1
internal.serving.kserve.io/storage-initializer-sourceuri: gs://kfserving-examples/models/sklearn/1.0/model
prometheus.kserve.io/path: /metrics
prometheus.kserve.io/port: 8080
sidecar.istio.io/inject: false
Creation Timestamp: <nil>
Labels:
Component: predictor
Notebook - Proxy: true
serving.kserve.io/inferenceservice: sklearn-iris
Spec:
Container Concurrency: 0
Containers:
Args:
--model_name=sklearn-iris
--model_dir=/mnt/models
--http_port=8080
Image: charmedkubeflow/sklearnserver:0.13.0-119414c
Name: kserve-container
Readiness Probe:
Success Threshold: 1
Tcp Socket:
Port: 0
Resources:
Limits:
Cpu: 1
Memory: 2Gi
Requests:
Cpu: 1
Memory: 2Gi
Enable Service Links: false
Timeout Seconds: 300
Traffic:
Latest Revision: true
Percent: 100
Status:
Conditions:
Last Transition Time: 2024-07-24T10:42:31Z
Status: Unknown
Type: ConfigurationsReady
Last Transition Time: 2024-07-24T10:42:31Z
Message: Configuration "sklearn-iris-predictor" is waiting for a Revision to become ready.
Reason: RevisionMissing
Status: Unknown
Type: Ready
Last Transition Time: 2024-07-24T10:42:31Z
Message: Configuration "sklearn-iris-predictor" is waiting for a Revision to become ready.
Reason: RevisionMissing
Status: Unknown
Type: RoutesReady
Latest Created Revision Name: sklearn-iris-predictor-00001
Observed Generation: 1
URL: http://sklearn-iris-predictor.admin.10.64.140.43.nip.io
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 5m3s service-controller Created Configuration "sklearn-iris-predictor"
Normal Created 5m3s service-controller Created Route "sklearn-iris-predictor"
so the tag resolution for the image was successful in this case and there is no longer the error from knative serving controller, but the ksvc is still not Ready
and the kserve UAT fails.
Looking at the pod of the inference service, it is stuck with Init:0/1
status:
sklearn-iris-predictor-00001-deployment-7c5d5b6478-2kmkj 0/2 Init:0/1 0 8m12s
The init container storage-initializer
of the inference pod is never completing.
Looking at the logs of the storage-initializer
container:
kubectl logs -n admin sklearn-iris-predictor-00001-deployment-7c5d5b6478-2kmkj -c storage-initializer
2024-07-25T09:25:22.829Z [pebble] Started daemon.
2024-07-25T09:25:22.844Z [pebble] POST /v1/services 8.790655ms 202
2024-07-25T09:25:22.849Z [pebble] Service "storage-initializer" starting: /storage-initializer/scripts/initializer-entrypoint [ gs://kfserving-examples/models/sklearn/1.0/model /mnt/models ]
2024-07-25T09:25:23.858Z [pebble] GET /v1/changes/1/wait 1.013923691s 200
2024-07-25T09:25:23.859Z [pebble] Started default services with change 1.
2024-07-25T09:25:33.450Z [storage-initializer] 2024-07-25 09:25:33.450 14 kserve INFO [initializer-entrypoint:<module>():16] Initializing, args: src_uri [gs://kfserving-examples/models/sklearn/1.0/model] dest_path[ [/mnt/models]
2024-07-25T09:25:33.450Z [storage-initializer] 2024-07-25 09:25:33.450 14 kserve INFO [storage.py:download():66] Copying contents of gs://kfserving-examples/models/sklearn/1.0/model to local
It is stuck at Copying contents of gs://kfserving-examples/models/sklearn/1.0/model to local
i.e. downloading the model from the model registry.
Eventually, the pod dies and the inference deployment is stuck at 0/0
:
NAME READY UP-TO-DATE AVAILABLE AGE
sklearn-iris-predictor-00001-deployment 0/0 0 0 29m
Describing the isvc after the pod is gone:
kubectl describe isvc -n admin
Name: sklearn-iris
Namespace: admin
Labels: notebook-proxy=true
Annotations: sidecar.istio.io/inject: false
API Version: serving.kserve.io/v1beta1
Kind: InferenceService
Metadata:
Creation Timestamp: 2024-07-25T09:25:20Z
Finalizers:
inferenceservice.finalizers
Generation: 1
Resource Version: 822860
UID: cd8d81c7-8005-4ddd-bc01-890c092d949e
Spec:
Predictor:
Model:
Model Format:
Name: sklearn
Name:
Resources:
Storage Uri: gs://kfserving-examples/models/sklearn/1.0/model
Status:
Components:
Predictor:
Latest Created Revision: sklearn-iris-predictor-00001
Conditions:
Last Transition Time: 2024-07-25T09:35:22Z
Reason: PredictorConfigurationReady not ready
Severity: Info
Status: False
Type: LatestDeploymentReady
Last Transition Time: 2024-07-25T09:41:36Z
Message: Revision "sklearn-iris-predictor-00001" failed with message: Initial scale was never achieved.
Reason: RevisionFailed
Severity: Info
Status: False
Type: PredictorConfigurationReady
Last Transition Time: 2024-07-25T09:35:22Z
Message: Configuration "sklearn-iris-predictor" does not have any ready Revision.
Reason: RevisionMissing
Status: False
Type: PredictorReady
Last Transition Time: 2024-07-25T09:35:22Z
Message: Configuration "sklearn-iris-predictor" does not have any ready Revision.
Reason: RevisionMissing
Severity: Info
Status: False
Type: PredictorRouteReady
Last Transition Time: 2024-07-25T09:35:22Z
Message: Configuration "sklearn-iris-predictor" does not have any ready Revision.
Reason: RevisionMissing
Status: False
Type: Ready
Last Transition Time: 2024-07-25T09:35:22Z
Reason: PredictorRouteReady not ready
Severity: Info
Status: False
Type: RoutesReady
Model Status:
Last Failure Info:
Exit Code: 10
Message:
2024-07-25T09:41:34.445Z [storage-initializer] response = self._get_next_page_response()
2024-07-25T09:41:34.445Z [storage-initializer] File "/usr/local/lib/python3.10/dist-packages/google/api_core/page_iterator.py", line 432, in _get_next_page_response
2024-07-25T09:41:34.445Z [storage-initializer] return self.api_request(
2024-07-25T09:41:34.445Z [storage-initializer] File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/_http.py", line 78, in api_request
2024-07-25T09:41:34.445Z [storage-initializer] return call()
2024-07-25T09:41:34.445Z [storage-initializer] File "/usr/local/lib/python3.10/dist-packages/google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
2024-07-25T09:41:34.445Z [storage-initializer] return retry_target(
2024-07-25T09:41:34.445Z [storage-initializer] File "/usr/local/lib/python3.10/dist-packages/google/api_core/retry/retry_unary.py", line 153, in retry_target
2024-07-25T09:41:34.445Z [storage-initializer] _retry_error_helper(
2024-07-25T09:41:34.445Z [storage-initializer] File "/usr/local/lib/python3.10/dist-packages/google/api_core/retry/retry_base.py", line 221, in _retry_error_helper
2024-07-25T09:41:34.445Z [storage-initializer] raise final_exc from source_exc
2024-07-25T09:41:34.445Z [storage-initializer] google.api_core.exceptions.RetryError: Timeout of 120.0s exceeded, last exception: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /storage/v1/b/kfserving-examples/o?projection=noAcl&prefix=models%2Fsklearn%2F1.0%2Fmodel%2F&prettyPrint=false (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f631010b160>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
2024-07-25T09:41:35.305Z [pebble] Service "storage-initializer" stopped unexpectedly with code 1
2024-07-25T09:41:35.305Z [pebble] Service "storage-initializer" on-failure action is "shutdown", triggering failure shutdown
2024-07-25T09:41:35.305Z [pebble] Server exiting!
We can see errors from the storage-initializer
container, specifically:
2024-07-25T09:41:34.445Z [storage-initializer] google.api_core.exceptions.RetryError: Timeout of 120.0s exceeded, last exception: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /storage/v1/b/kfserving-examples/o?projection=noAcl&prefix=models%2Fsklearn%2F1.0%2Fmodel%2F&prettyPrint=false (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f631010b160>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
Looks like another internet acess issue where the container cannot download the model artifact.
To resolve this, the storage-initializer
container needs to be able to get through the proxy to have network reach outside the machine and to the model registry.
In this case, we need to pass the proxy envs to the storage-initializer
container. I looked it up and saw there's a similar issue in kserve repo: https://github.com/kserve/kserve/issues/1348, there it says that KServe 0.11.1 introduced ClusterStorageContainer
which will allow you to modify storage-initalizer spec.
In our kserve-controller
charm, we are creating the ClusterStorageContainer
CR using the cluster_storage_containers.yaml.j2
manifest template. We can extend this template to optionally set the proxy envs under spec.container
.
To test this possible fix, I did the following:
kserve-operators
repocluster_storage_containers.yaml.j2
with the diff:
@@ -6,6 +6,13 @@ spec:
container:
image: {{ configmap__storageInitializer }}
name: storage-initializer
+ env:
+ - name: HTTP_PROXY
+ value: http://10.0.13.50:3128
+ - name: HTTPS_PROXY
+ value: http://10.0.13.50:3128
+ - name: NO_PROXY
+ value: 10.152.183.0/24
resources:
limits:
cpu: "1"
where:
HTTP_PROXY
and HTTPS_PROXY
have the values of the proxy serverNO_PROXY
has the value of the service cluster ip range:kserve-controller
charmjuju refresh kserve-controller --path=./kserve-controller_ubuntu-20.04-amd64.charm
knative-serving
charm with the fix from aboveReady
!
In the successful isvc's pod description, we can see the proxy envs set correctly for the init container:
initContainers:
- args:
- gs://kfserving-examples/models/sklearn/1.0/model
- /mnt/models
env:
- name: HTTP_PROXY
value: http://10.0.13.50:3128
- name: HTTPS_PROXY
value: http://10.0.13.50:3128
- name: NO_PROXY
value: 10.152.183.0/24
image: charmedkubeflow/storage-initializer:0.13.0-70e4564
imagePullPolicy: IfNotPresent
name: storage-initializer
This issue expanded to be 2 issues in fact that are currently blocking serving in CKF from working correctly behind proxy:
Both issues are due to the pods responsible for pulling these artifacts/data not being able to establish a connection to their targets, and thus they need to have the proxy env vars set in order to unblock.
proxy
and no-proxy
config to kserve-controller
and knative-serving
charmsKnativeServing.yaml.j2
in knative-serving
charmcluster_storage_containers.yaml.j2
in kserve-controller
charmto have the proxy env vars set optionally, if they are set in the charm config.
@NohaIhab excellent job getting to the bottom of it and documenting everything!
I also agree with adding config options for the proxy to the kserve-controller
and kserve-serving
charms, which will end up altering the applied manifests. It's actually quite nice to see the charms to abstract this and only have a lean configuration for the proxy to the user!
My only concern is if we should have just one proxy
config option, or make it 1-1 with the env var and introduce the following config options: http-proxy
, https-proxy
, no-proxy
@kimwnasptd good point, I think we can even make it more clear by specifying for which container this proxy env is being set. For exmaple in knative-serving
, make the config name controller-http-proxy
, controller-https-proxy
, and controller-no-proxy
.
In kserve-controller
, it can be storage-initializer-http-proxy
.. etc.
IMO sticking to just expressing the functionality, and not also including the component, in the config name will provide a better UX.
The users don't necessarily care (AFAIK) for configuring different proxy settings for the different components of the same Knative Charm. They'd care about telling the charm what the proxy values are and then it's up to the Charm to do all the configurations wherever necessary.
The above is for a hypothetical scenario that in the future we might need storage-initializer-http-proxy
and queue-http-proxy
(random thought), for the kserve-controller
charm. Users wouldn't care to put different values for the different sub-components and would only want to pass them once to the kserve-controller
charm and then the charm configuring everything.
Closed by above linked PRs
Bug Description
During working on https://github.com/canonical/charmed-kubeflow-uats/issues/75, the kserve UAT fails behind proxy due to Knative Service failing to start. The Knative Service is not
Ready
with theRevisionFailed
reason. The logs show that the Knative Serving controller is failing to resolve the image tag to digest forcharmedkubeflow/sklearnserver:0.11.2-e54c69e
image.To Reproduce
kubeflow
bundle1.9/beta
behind proxyEnvironment
microk8s 1.29-strict/stable juju 3.4.4 CKF 1.9/beta
Relevant Log Output
Additional Context
No response