canonical / knative-operators

Charmed Knative Operators
Apache License 2.0
1 stars 2 forks source link

KSVC failed to start behind proxy with `RevisionFailed` reason: `failed to resolve image to digest` #204

Closed NohaIhab closed 1 month ago

NohaIhab commented 1 month ago

Bug Description

During working on https://github.com/canonical/charmed-kubeflow-uats/issues/75, the kserve UAT fails behind proxy due to Knative Service failing to start. The Knative Service is not Ready with the RevisionFailed reason. The logs show that the Knative Serving controller is failing to resolve the image tag to digest for charmedkubeflow/sklearnserver:0.11.2-e54c69e image.

To Reproduce

  1. Deploy kubeflow bundle 1.9/beta behind proxy
  2. Run the KServe UAT
  3. Describe the ksvc

Environment

microk8s 1.29-strict/stable juju 3.4.4 CKF 1.9/beta

Relevant Log Output

Revision "sklearn-iris-predictor-00001" failed with message: Unable to fetch image "charmedkubeflow/sklearnserver:0.11.2-e54c69e": failed to resolve image to digest: Get "https://index.docker.io/v2/": context deadline exceeded.

Additional Context

No response

syncronize-issues-to-jira[bot] commented 1 month ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6041.

This message was autogenerated

NohaIhab commented 1 month ago

The knative service is failing to start because it is unable to resolve the image digest, we expect this to be due to not being able to reach the internet to get the digest. What we can do here:

  1. Configure the proxy envs in the serving controller - there are some pointers on doing this in the knative docs here and here OR
  2. Skip tag resolution for the server registries to avoid bumping into the issue

I suggest we go with (1) because skipping tag resolution is rather a workaround, not a solution. We should have our charm expose the configuration of proxy env vars in the serving controller.

NohaIhab commented 1 month ago

Experimenting with option (1)

to configure the proxy envs in the serving controller:

  1. edited the knative-serving CR manifest as follows:
    --- a/charms/knative-serving/src/manifests/KnativeServing.yaml.j2
    +++ b/charms/knative-serving/src/manifests/KnativeServing.yaml.j2
    @@ -6,6 +6,17 @@ metadata:
    namespace: {{ serving_namespace }}
    spec:
    version: {{ serving_version }}
    +  workloads:
    +  - name: controller
    +    env:
    +    - container: controller
    +      envVars:
    +      - name: HTTP_PROXY
    +        value: http://10.0.13.50:3128
    +      - name: HTTPS_PROXY
    +        value: http://10.0.13.50:3128
    +      - name: NO_PROXY
    +        value: 10.152.183.0/24
    config:
     deployment:
       progress-deadline: {{ progress_deadline}}

    where:

    • HTTP_PROXY and HTTPS_PROXY have the values of the proxy server
    • NO_PROXY has the value of the service cluster ip range:
      cat /var/snap/microk8s/current/args/kube-apiserver | grep service-cluster-ip-range
      --service-cluster-ip-range=10.152.183.0/24
  2. re-packed the knative-serving charm
  3. refreshed the knative-serving charm from latest/beta to the local one
  4. Ran the kserve UATs with the modification of adding the proxy envs to the V1ObjectMeta of ISVC definition

Results

Debugging

Looking at the pod of the inference service, it is stuck with Init:0/1 status:

sklearn-iris-predictor-00001-deployment-7c5d5b6478-2kmkj   0/2     Init:0/1   0             8m12s

The init container storage-initializer of the inference pod is never completing. Looking at the logs of the storage-initializer container:

kubectl logs -n admin sklearn-iris-predictor-00001-deployment-7c5d5b6478-2kmkj -c storage-initializer
2024-07-25T09:25:22.829Z [pebble] Started daemon.
2024-07-25T09:25:22.844Z [pebble] POST /v1/services 8.790655ms 202
2024-07-25T09:25:22.849Z [pebble] Service "storage-initializer" starting: /storage-initializer/scripts/initializer-entrypoint [ gs://kfserving-examples/models/sklearn/1.0/model /mnt/models ]
2024-07-25T09:25:23.858Z [pebble] GET /v1/changes/1/wait 1.013923691s 200
2024-07-25T09:25:23.859Z [pebble] Started default services with change 1.
2024-07-25T09:25:33.450Z [storage-initializer] 2024-07-25 09:25:33.450 14 kserve INFO [initializer-entrypoint:<module>():16] Initializing, args: src_uri [gs://kfserving-examples/models/sklearn/1.0/model] dest_path[ [/mnt/models]
2024-07-25T09:25:33.450Z [storage-initializer] 2024-07-25 09:25:33.450 14 kserve INFO [storage.py:download():66] Copying contents of gs://kfserving-examples/models/sklearn/1.0/model to local

It is stuck at Copying contents of gs://kfserving-examples/models/sklearn/1.0/model to local i.e. downloading the model from the model registry. Eventually, the pod dies and the inference deployment is stuck at 0/0:

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
sklearn-iris-predictor-00001-deployment   0/0     0            0           29m

Describing the isvc after the pod is gone:

kubectl describe isvc -n admin
Name:         sklearn-iris
Namespace:    admin
Labels:       notebook-proxy=true
Annotations:  sidecar.istio.io/inject: false
API Version:  serving.kserve.io/v1beta1
Kind:         InferenceService
Metadata:
  Creation Timestamp:  2024-07-25T09:25:20Z
  Finalizers:
    inferenceservice.finalizers
  Generation:        1
  Resource Version:  822860
  UID:               cd8d81c7-8005-4ddd-bc01-890c092d949e
Spec:
  Predictor:
    Model:
      Model Format:
        Name:  sklearn
      Name:    
      Resources:
      Storage Uri:  gs://kfserving-examples/models/sklearn/1.0/model
Status:
  Components:
    Predictor:
      Latest Created Revision:  sklearn-iris-predictor-00001
  Conditions:
    Last Transition Time:  2024-07-25T09:35:22Z
    Reason:                PredictorConfigurationReady not ready
    Severity:              Info
    Status:                False
    Type:                  LatestDeploymentReady
    Last Transition Time:  2024-07-25T09:41:36Z
    Message:               Revision "sklearn-iris-predictor-00001" failed with message: Initial scale was never achieved.
    Reason:                RevisionFailed
    Severity:              Info
    Status:                False
    Type:                  PredictorConfigurationReady
    Last Transition Time:  2024-07-25T09:35:22Z
    Message:               Configuration "sklearn-iris-predictor" does not have any ready Revision.
    Reason:                RevisionMissing
    Status:                False
    Type:                  PredictorReady
    Last Transition Time:  2024-07-25T09:35:22Z
    Message:               Configuration "sklearn-iris-predictor" does not have any ready Revision.
    Reason:                RevisionMissing
    Severity:              Info
    Status:                False
    Type:                  PredictorRouteReady
    Last Transition Time:  2024-07-25T09:35:22Z
    Message:               Configuration "sklearn-iris-predictor" does not have any ready Revision.
    Reason:                RevisionMissing
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-07-25T09:35:22Z
    Reason:                PredictorRouteReady not ready
    Severity:              Info
    Status:                False
    Type:                  RoutesReady
  Model Status:
    Last Failure Info:
      Exit Code:  10
      Message:    
2024-07-25T09:41:34.445Z [storage-initializer]     response = self._get_next_page_response()
2024-07-25T09:41:34.445Z [storage-initializer]   File "/usr/local/lib/python3.10/dist-packages/google/api_core/page_iterator.py", line 432, in _get_next_page_response
2024-07-25T09:41:34.445Z [storage-initializer]     return self.api_request(
2024-07-25T09:41:34.445Z [storage-initializer]   File "/usr/local/lib/python3.10/dist-packages/google/cloud/storage/_http.py", line 78, in api_request
2024-07-25T09:41:34.445Z [storage-initializer]     return call()
2024-07-25T09:41:34.445Z [storage-initializer]   File "/usr/local/lib/python3.10/dist-packages/google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
2024-07-25T09:41:34.445Z [storage-initializer]     return retry_target(
2024-07-25T09:41:34.445Z [storage-initializer]   File "/usr/local/lib/python3.10/dist-packages/google/api_core/retry/retry_unary.py", line 153, in retry_target
2024-07-25T09:41:34.445Z [storage-initializer]     _retry_error_helper(
2024-07-25T09:41:34.445Z [storage-initializer]   File "/usr/local/lib/python3.10/dist-packages/google/api_core/retry/retry_base.py", line 221, in _retry_error_helper
2024-07-25T09:41:34.445Z [storage-initializer]     raise final_exc from source_exc
2024-07-25T09:41:34.445Z [storage-initializer] google.api_core.exceptions.RetryError: Timeout of 120.0s exceeded, last exception: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /storage/v1/b/kfserving-examples/o?projection=noAcl&prefix=models%2Fsklearn%2F1.0%2Fmodel%2F&prettyPrint=false (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f631010b160>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
2024-07-25T09:41:35.305Z [pebble] Service "storage-initializer" stopped unexpectedly with code 1
2024-07-25T09:41:35.305Z [pebble] Service "storage-initializer" on-failure action is "shutdown", triggering failure shutdown
2024-07-25T09:41:35.305Z [pebble] Server exiting!

We can see errors from the storage-initializer container, specifically:

2024-07-25T09:41:34.445Z [storage-initializer] google.api_core.exceptions.RetryError: Timeout of 120.0s exceeded, last exception: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /storage/v1/b/kfserving-examples/o?projection=noAcl&prefix=models%2Fsklearn%2F1.0%2Fmodel%2F&prettyPrint=false (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f631010b160>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

Looks like another internet acess issue where the container cannot download the model artifact. To resolve this, the storage-initializer container needs to be able to get through the proxy to have network reach outside the machine and to the model registry.

In this case, we need to pass the proxy envs to the storage-initializer container. I looked it up and saw there's a similar issue in kserve repo: https://github.com/kserve/kserve/issues/1348, there it says that KServe 0.11.1 introduced ClusterStorageContainer which will allow you to modify storage-initalizer spec.

In our kserve-controller charm, we are creating the ClusterStorageContainer CR using the cluster_storage_containers.yaml.j2 manifest template. We can extend this template to optionally set the proxy envs under spec.container.

To test this possible fix, I did the following:

  1. cloned the kserve-operators repo
  2. modified the cluster_storage_containers.yaml.j2 with the diff:
    @@ -6,6 +6,13 @@ spec:
    container:
     image: {{ configmap__storageInitializer }}
     name: storage-initializer
    +    env:
    +    - name: HTTP_PROXY
    +      value: http://10.0.13.50:3128
    +    - name: HTTPS_PROXY
    +      value: http://10.0.13.50:3128
    +    - name: NO_PROXY
    +      value: 10.152.183.0/24
     resources:
       limits:
         cpu: "1"

    where:

    • HTTP_PROXY and HTTPS_PROXY have the values of the proxy server
    • NO_PROXY has the value of the service cluster ip range:
  3. packed the kserve-controller charm
  4. refreshed the charm in the bundle to the local one with:
    juju refresh kserve-controller --path=./kserve-controller_ubuntu-20.04-amd64.charm
  5. Kept using the local knative-serving charm with the fix from above
  6. Ran the kserve UATs with the modification to use the poddefault for the isvc pod as defined in https://github.com/canonical/charmed-kubeflow-uats/issues/76
  7. Now the kserve UATs are passing and I can see the KSVC and ISVC being Ready! In the successful isvc's pod description, we can see the proxy envs set correctly for the init container:
    initContainers:
    - args:
    - gs://kfserving-examples/models/sklearn/1.0/model
    - /mnt/models
    env:
    - name: HTTP_PROXY
      value: http://10.0.13.50:3128
    - name: HTTPS_PROXY
      value: http://10.0.13.50:3128
    - name: NO_PROXY
      value: 10.152.183.0/24
    image: charmedkubeflow/storage-initializer:0.13.0-70e4564
    imagePullPolicy: IfNotPresent
    name: storage-initializer
NohaIhab commented 1 month ago

Conclusion

This issue expanded to be 2 issues in fact that are currently blocking serving in CKF from working correctly behind proxy:

  1. knative tag-to-digest resolution failing
  2. kserve model download failing

Both issues are due to the pods responsible for pulling these artifacts/data not being able to establish a connection to their targets, and thus they need to have the proxy env vars set in order to unblock.

Proposed fix

  1. Introduce a proxy and no-proxy config to kserve-controller and knative-serving charms
  2. Modify the manifests template for each of:
    • KnativeServing.yaml.j2 in knative-serving charm
    • cluster_storage_containers.yaml.j2 in kserve-controller charm

to have the proxy env vars set optionally, if they are set in the charm config.

kimwnasptd commented 1 month ago

@NohaIhab excellent job getting to the bottom of it and documenting everything!

I also agree with adding config options for the proxy to the kserve-controller and kserve-serving charms, which will end up altering the applied manifests. It's actually quite nice to see the charms to abstract this and only have a lean configuration for the proxy to the user!

My only concern is if we should have just one proxy config option, or make it 1-1 with the env var and introduce the following config options: http-proxy, https-proxy, no-proxy

NohaIhab commented 1 month ago

@kimwnasptd good point, I think we can even make it more clear by specifying for which container this proxy env is being set. For exmaple in knative-serving, make the config name controller-http-proxy, controller-https-proxy, and controller-no-proxy. In kserve-controller, it can be storage-initializer-http-proxy .. etc.

kimwnasptd commented 1 month ago

IMO sticking to just expressing the functionality, and not also including the component, in the config name will provide a better UX.

The users don't necessarily care (AFAIK) for configuring different proxy settings for the different components of the same Knative Charm. They'd care about telling the charm what the proxy values are and then it's up to the Charm to do all the configurations wherever necessary.

The above is for a hypothetical scenario that in the future we might need storage-initializer-http-proxy and queue-http-proxy (random thought), for the kserve-controller charm. Users wouldn't care to put different values for the different sub-components and would only want to pass them once to the kserve-controller charm and then the charm configuring everything.

orfeas-k commented 1 month ago

Closed by above linked PRs