canonical / knative-operators

Charmed Knative Operators
Apache License 2.0
1 stars 2 forks source link

`queue-proxy` container image is not set in the `KnativeService`'s pod when configured in the charm #183

Closed NohaIhab closed 3 months ago

NohaIhab commented 4 months ago

Bug Description

Hit this issue while testing CKF 1.8 in airgapped (related to https://github.com/canonical/bundle-kubeflow/issues/889 and https://github.com/canonical/bundle-kubeflow/issues/898): When configuring the queue-proxy image in the custom images, then creating a KnativeService, the queue-proxy image in the KnativeService gets the default value, not the one configured in the charm. This is a blocker for using KNative in an airgapped environment. See https://github.com/canonical/knative-operators/issues/140 for context on configuring the custom images. Looking at the knative-serving charm's config, we can see that the custom image is set there:

juju config knative-serving custom_images
activator: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/activator:c2994c2b6c2c7f38ad1b85c71789bf1753cc8979926423c83231e62258837cb9
autoscaler: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/autoscaler:8319aa662b4912e8175018bd7cc90c63838562a27515197b803bdcd5634c7007
controller: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/controller:98a2cc7fd62ee95e137116504e7166c32c65efef42c3d1454630780410abf943
webhook: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/webhook:4305209ce498caf783f39c8f3e85dfa635ece6947033bf50b0b627983fd65953
autoscaler-hpa: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/autoscaler-hpa:eb612b929eaa57ef1573a04035a8d06c9ed88fd56e741c56bd909f5f49e4d732
net-istio-controller/controller: 172.17.0.2:5000/knative-releases/knative.dev/net-istio/cmd/controller:27e7beb7c62036216fc464fb2181e56b030158ad4ceb57a7de172f54b4fe43db
net-istio-webhook/webhook: 172.17.0.2:5000/knative-releases/knative.dev/net-istio/cmd/webhook:0cdef272e39c57971ce9977765f164dd8e3abb9395a4f60e7a4160d57dcc09f2
queue-proxy: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/queue:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a
domain-mapping: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/domain-mapping:f66c41ad7a73f5d4f4bdfec4294d5459c477f09f3ce52934d1a215e32316b59b
domainmapping-webhook: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook:7368aaddf2be8d8784dc7195f5bc272ecfe49d429697f48de0ddc44f278167aa

and Looking at the KnativeServing CR, we can see that the queue-proxy field is set there correctly in the registry section:

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  creationTimestamp: "2024-05-28T12:09:54Z"
  finalizers:
  - knativeservings.operator.knative.dev
  generation: 2
  name: knative-serving
  namespace: knative-serving
  resourceVersion: "548475"
  uid: 2600de7a-1c4a-4dc6-a14e-aa9fd15fe19d
spec:
  config:
    domain:
      10.64.140.43.nip.io: ""
    istio:
      gateway.kubeflow.kubeflow-gateway: some-workload.knative-operator.svc.cluster.local
      local-gateway.knative-serving.knative-local-gateway: knative-local-gateway.kubeflow.svc.cluster.local
  registry:
    override:
      activator: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/activator:c2994c2b6c2c7f38ad1b85c71789bf1753cc8979926423c83231e62258837cb9
      autoscaler: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/autoscaler:8319aa662b4912e8175018bd7cc90c63838562a27515197b803bdcd5634c7007
      autoscaler-hpa: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/autoscaler-hpa:eb612b929eaa57ef1573a04035a8d06c9ed88fd56e741c56bd909f5f49e4d732
      controller: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/controller:98a2cc7fd62ee95e137116504e7166c32c65efef42c3d1454630780410abf943
      domain-mapping: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/domain-mapping:f66c41ad7a73f5d4f4bdfec4294d5459c477f09f3ce52934d1a215e32316b59b
      domainmapping-webhook: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook:7368aaddf2be8d8784dc7195f5bc272ecfe49d429697f48de0ddc44f278167aa
      net-istio-controller/controller: 172.17.0.2:5000/knative-releases/knative.dev/net-istio/cmd/controller:27e7beb7c62036216fc464fb2181e56b030158ad4ceb57a7de172f54b4fe43db
      net-istio-webhook/webhook: 172.17.0.2:5000/knative-releases/knative.dev/net-istio/cmd/webhook:0cdef272e39c57971ce9977765f164dd8e3abb9395a4f60e7a4160d57dcc09f2
      queue-proxy: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/queue:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a
      webhook: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/webhook:4305209ce498caf783f39c8f3e85dfa635ece6947033bf50b0b627983fd65953
  version: 1.10.2
status:
  conditions:
  - lastTransitionTime: "2024-05-28T12:12:20Z"
    status: "True"
    type: DependenciesInstalled
  - lastTransitionTime: "2024-05-29T10:48:32Z"
    status: "True"
    type: DeploymentsAvailable
  - lastTransitionTime: "2024-05-28T12:12:20Z"
    status: "True"
    type: InstallSucceeded
  - lastTransitionTime: "2024-05-29T10:48:32Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-05-28T12:12:02Z"
    status: "True"
    type: VersionMigrationEligible
  manifests:
  - /var/run/ko/knative-serving/1.10.2
  - /var/run/ko/ingress/1.10/istio
  observedGeneration: 2
  version: 1.10.2

However, it is not getting picked up by the KnativeService CR's Pod:

Image:          gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a

It still sets the default image in the container^

To Reproduce

  1. Deploy CKF 1.8/stable in airgapped
  2. Configure the queue-proxy image to the image in the local registry, in my case I set it to 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/queue:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a
  3. Create this KnativeService example:
    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
    name: helloworld
    namespace: admin
    spec:
    template:
    spec:
      containers:
      - image: python:3.11.9-alpine
        command: [sleep, "3600"]
        ports:
          - containerPort: 8080
        env:
          - name: TARGET
            value: "World"
  4. Check the queue-proxy image in the KnativeService pod:
    pod description
kubectl describe po helloworld-00001-deployment-854f6f5f6d-sfbbf -nadmin
Name:             helloworld-00001-deployment-854f6f5f6d-sfbbf
Namespace:        admin
Priority:         0
Service Account:  default
Node:             airgapped-microk8s/10.85.129.236
Start Time:       Wed, 29 May 2024 11:10:03 +0000
Labels:           app=helloworld-00001
                  pod-template-hash=854f6f5f6d
                  security.istio.io/tlsMode=istio
                  service.istio.io/canonical-name=helloworld
                  service.istio.io/canonical-revision=helloworld-00001
                  serving.knative.dev/configuration=helloworld
                  serving.knative.dev/configurationGeneration=1
                  serving.knative.dev/configurationUID=135fa2d4-7270-4607-91b0-c859ec89af71
                  serving.knative.dev/revision=helloworld-00001
                  serving.knative.dev/revisionUID=82e82ada-8e20-4993-9158-9099212b0bee
                  serving.knative.dev/service=helloworld
                  serving.knative.dev/serviceUID=553ba036-a8d3-414b-b941-fab4a2263010
Annotations:      cni.projectcalico.org/containerID: d7341aeb83813809a63340ba03907673cd4a37d4060f7bf9135182b254f67c0e
                  cni.projectcalico.org/podIP: 10.1.205.179/32
                  cni.projectcalico.org/podIPs: 10.1.205.179/32
                  kubectl.kubernetes.io/default-container: user-container
                  kubectl.kubernetes.io/default-logs-container: user-container
                  prometheus.io/path: /stats/prometheus
                  prometheus.io/port: 15020
                  prometheus.io/scrape: true
                  serving.knative.dev/creator: admin
                  sidecar.istio.io/status:
                    {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["workload-socket","credential-socket","workload-certs","istio-env...
Status:           Pending
IP:               10.1.205.179
IPs:
  IP:           10.1.205.179
Controlled By:  ReplicaSet/helloworld-00001-deployment-854f6f5f6d
Init Containers:
  istio-init:
    Container ID:  containerd://60baff0104f694b5610059afbacdf0fe2b2bf9f9e7b6190350bcccdd45bad030
    Image:         172.17.0.2:5000/istio/proxyv2:1.17.3
    Image ID:      172.17.0.2:5000/istio/proxyv2@sha256:ea9373309e35569cf2a011e973aa1f49e0354fd2d730e8a0d5cc25964499a100
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
      --log_output_level=default:info
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 29 May 2024 11:10:05 +0000
      Finished:     Wed, 29 May 2024 11:10:05 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     128Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m7szg (ro)
Containers:
  user-container:
    Container ID:  containerd://ea66e01e58b900ce3a0d3abb839086f10005272e40337a6b53a7c327e42af696
    Image:         172.17.0.2:5000/python@sha256:df44c0c0761ddbd6388f4549cab42d24d64d257c2a960ad5b276bb7dab9639c7
    Image ID:      172.17.0.2:5000/python@sha256:df44c0c0761ddbd6388f4549cab42d24d64d257c2a960ad5b276bb7dab9639c7
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      sleep
      3600
    State:          Running
      Started:      Wed, 29 May 2024 11:10:06 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      TARGET:           World
      PORT:             8080
      K_REVISION:       helloworld-00001
      K_CONFIGURATION:  helloworld
      K_SERVICE:        helloworld
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m7szg (ro)
  queue-proxy:
    Container ID:   
    Image:          gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a
    Image ID:       
    Ports:          8022/TCP, 9090/TCP, 9091/TCP, 8012/TCP, 8112/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Waiting
      Reason:       ErrImagePull
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      25m
    Readiness:  http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      SERVING_NAMESPACE:                        admin
      SERVING_SERVICE:                          helloworld
      SERVING_CONFIGURATION:                    helloworld
      SERVING_REVISION:                         helloworld-00001
      QUEUE_SERVING_PORT:                       8012
      QUEUE_SERVING_TLS_PORT:                   8112
      CONTAINER_CONCURRENCY:                    0
      REVISION_TIMEOUT_SECONDS:                 300
      REVISION_RESPONSE_START_TIMEOUT_SECONDS:  0
      REVISION_IDLE_TIMEOUT_SECONDS:            0
      SERVING_POD:                              helloworld-00001-deployment-854f6f5f6d-sfbbf (v1:metadata.name)
      SERVING_POD_IP:                            (v1:status.podIP)
      SERVING_LOGGING_CONFIG:                   
      SERVING_LOGGING_LEVEL:                    
      SERVING_REQUEST_LOG_TEMPLATE:             {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}
      SERVING_ENABLE_REQUEST_LOG:               false
      SERVING_REQUEST_METRICS_BACKEND:          prometheus
      TRACING_CONFIG_BACKEND:                   none
      TRACING_CONFIG_ZIPKIN_ENDPOINT:           
      TRACING_CONFIG_DEBUG:                     false
      TRACING_CONFIG_SAMPLE_RATE:               0.1
      USER_PORT:                                8080
      SYSTEM_NAMESPACE:                         knative-serving
      METRICS_DOMAIN:                           knative.dev/internal/serving
      SERVING_READINESS_PROBE:                  {"tcpSocket":{"port":8080,"host":"127.0.0.1"},"successThreshold":1}
      ENABLE_PROFILING:                         false
      SERVING_ENABLE_PROBE_REQUEST_LOG:         false
      METRICS_COLLECTOR_ADDRESS:                
      HOST_IP:                                   (v1:status.hostIP)
      ENABLE_HTTP2_AUTO_DETECTION:              false
      ROOT_CA:                                  
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m7szg (ro)
  istio-proxy:
    Container ID:  containerd://afcad5017289db34701349ee19e7017b1573950ae909e73d32b574d5f31b7f27
    Image:         172.17.0.2:5000/istio/proxyv2:1.17.3
    Image ID:      172.17.0.2:5000/istio/proxyv2@sha256:ea9373309e35569cf2a011e973aa1f49e0354fd2d730e8a0d5cc25964499a100
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Running
      Started:      Wed, 29 May 2024 11:10:13 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   128Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    third-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.kubeflow.svc:15012
      POD_NAME:                      helloworld-00001-deployment-854f6f5f6d-sfbbf (v1:metadata.name)
      POD_NAMESPACE:                 admin (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      PROXY_CONFIG:                  {"discoveryAddress":"istiod.kubeflow.svc:15012","tracing":{"zipkin":{"address":"zipkin.kubeflow:9411"}}}

      ISTIO_META_POD_PORTS:          [
                                         {"name":"user-port","containerPort":8080,"protocol":"TCP"}
                                         ,{"name":"http-queueadm","containerPort":8022,"protocol":"TCP"}
                                         ,{"name":"http-autometric","containerPort":9090,"protocol":"TCP"}
                                         ,{"name":"http-usermetric","containerPort":9091,"protocol":"TCP"}
                                         ,{"name":"queue-port","containerPort":8012,"protocol":"TCP"}
                                         ,{"name":"https-port","containerPort":8112,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     user-container,queue-proxy
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_NODE_NAME:           (v1:spec.nodeName)
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_META_WORKLOAD_NAME:      helloworld-00001-deployment
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/admin/deployments/helloworld-00001-deployment
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
      ISTIO_KUBE_APP_PROBERS:        {"/app-health/queue-proxy/readyz":{"httpGet":{"path":"/","port":8012,"scheme":"HTTP","httpHeaders":[{"name":"K-Network-Probe","value":"queue"}]},"timeoutSeconds":1}}
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/credential-uds from credential-socket (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m7szg (ro)
      /var/run/secrets/tokens from istio-token (rw)
      /var/run/secrets/workload-spiffe-credentials from workload-certs (rw)
      /var/run/secrets/workload-spiffe-uds from workload-socket (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  workload-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  credential-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  workload-certs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  kube-api-access-m7szg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  87s                default-scheduler  Successfully assigned admin/helloworld-00001-deployment-854f6f5f6d-sfbbf to airgapped-microk8s
  Normal   Pulled     86s                kubelet            Container image "172.17.0.2:5000/istio/proxyv2:1.17.3" already present on machine
  Normal   Created    85s                kubelet            Created container istio-init
  Normal   Started    85s                kubelet            Started container istio-init
  Normal   Pulled     85s                kubelet            Container image "172.17.0.2:5000/python@sha256:df44c0c0761ddbd6388f4549cab42d24d64d257c2a960ad5b276bb7dab9639c7" already present on machine
  Normal   Created    84s                kubelet            Created container user-container
  Normal   Started    84s                kubelet            Started container user-container
  Normal   Pulled     78s                kubelet            Container image "172.17.0.2:5000/istio/proxyv2:1.17.3" already present on machine
  Normal   Created    77s                kubelet            Created container istio-proxy
  Normal   Started    77s                kubelet            Started container istio-proxy
  Warning  Failed     54s (x2 over 78s)  kubelet            Error: ErrImagePull
  Normal   BackOff    40s (x4 over 77s)  kubelet            Back-off pulling image "gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a"
  Warning  Failed     40s (x4 over 77s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling    29s (x3 over 84s)  kubelet            Pulling image "gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a"
  Warning  Failed     23s (x3 over 78s)  kubelet            Failed to pull image "gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a": rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a": failed to resolve reference "gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a": failed to do request: Head "https://gcr.io/v2/knative-releases/knative.dev/serving/cmd/queue/manifests/sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a": dial tcp 74.125.206.82:443: connect: no route to host

Environment

airgapped environment microk8s 1.25-strict/stable juju 3.1/stable

Relevant Log Output

Failed to pull image "gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a": rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a": failed to resolve reference "gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a": failed to do request: Head "https://gcr.io/v2/knative-releases/knative.dev/serving/cmd/queue/manifests/sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a": dial tcp 74.125.206.82:443: connect: no route to host

Additional Context

No response

syncronize-issues-to-jira[bot] commented 4 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5753.

This message was autogenerated

kimwnasptd commented 4 months ago

Looking a bit around in upstream manifests, it looks like we need to configure the queue-proxy image in this place of the manifests https://github.com/kubeflow/manifests/blob/v1.8.0/common/knative/knative-serving/base/upstream/serving-core.yaml#L4667

We'll need to ensure we can configure this also via the KnativeServing CR

kimwnasptd commented 4 months ago

Lastly, in those manifests I also see an Image CustomResource where they also define an image, which we should most probably need to patch

https://github.com/kubeflow/manifests/blob/v1.8.0/common/knative/knative-serving/base/upstream/serving-core.yaml#L4307-L4317

NohaIhab commented 3 months ago

Thanks @kimwnasptd for the pointers. Inspecting my airgapped cluster, indeed I see the queue-sidecar-image set to the upstream image in config-deployment ConfigMap in the knative-serving namespace:

kubectl get ConfigMap -n knative-serving config-deployment -oyaml
apiVersion: v1
data:
  _example: |-
    ################################
    #                              #
    #    EXAMPLE CONFIGURATION     #
    #                              #
    ################################

    # This block is not actually functional configuration,
    # but serves to illustrate the available configuration
    # options and document them in a way that is accessible
    # to users that `kubectl edit` this config map.
    #
    # These sample configuration options may be copied out of
    # this example block and unindented to be in the data block
    # to actually change the configuration.

    # List of repositories for which tag to digest resolving should be skipped
    registries-skipping-tag-resolving: "kind.local,ko.local,dev.local"

    # Maximum time allowed for an image's digests to be resolved.
    digest-resolution-timeout: "10s"

    # Duration we wait for the deployment to be ready before considering it failed.
    progress-deadline: "600s"

    # Sets the queue proxy's CPU request.
    # If omitted, a default value (currently "25m"), is used.
    queue-sidecar-cpu-request: "25m"

    # Sets the queue proxy's CPU limit.
    # If omitted, no value is specified and the system default is used.
    queue-sidecar-cpu-limit: "1000m"

    # Sets the queue proxy's memory request.
    # If omitted, no value is specified and the system default is used.
    queue-sidecar-memory-request: "400Mi"

    # Sets the queue proxy's memory limit.
    # If omitted, no value is specified and the system default is used.
    queue-sidecar-memory-limit: "800Mi"

    # Sets the queue proxy's ephemeral storage request.
    # If omitted, no value is specified and the system default is used.
    queue-sidecar-ephemeral-storage-request: "512Mi"

    # Sets the queue proxy's ephemeral storage limit.
    # If omitted, no value is specified and the system default is used.
    queue-sidecar-ephemeral-storage-limit: "1024Mi"

    # Sets tokens associated with specific audiences for queue proxy - used by QPOptions
    #
    # For example, to add the `service-x` audience:
    #    queue-sidecar-token-audiences: "service-x"
    # Also supports a list of audiences, for example:
    #    queue-sidecar-token-audiences: "service-x,service-y"
    # If omitted, or empty, no tokens are created
    queue-sidecar-token-audiences: ""

    # Sets rootCA for the queue proxy - used by QPOptions
    # If omitted, or empty, no rootCA is added to the golang rootCAs
    queue-sidecar-rootca: ""
  queue-sidecar-image: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a
kind: ConfigMap
metadata:
  annotations:
    knative.dev/example-checksum: 410041a0
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"_example":"################################\n#                              #\n#    EXAMPLE CONFIGURATION     #\n#                              #\n################################\n\n# This block is not actually functional configuration,\n# but serves to illustrate the available configuration\n# options and document them in a way that is accessible\n# to users that `kubectl edit` this config map.\n#\n# These sample configuration options may be copied out of\n# this example block and unindented to be in the data block\n# to actually change the configuration.\n\n# List of repositories for which tag to digest resolving should be skipped\nregistries-skipping-tag-resolving: \"kind.local,ko.local,dev.local\"\n\n# Maximum time allowed for an image's digests to be resolved.\ndigest-resolution-timeout: \"10s\"\n\n# Duration we wait for the deployment to be ready before considering it failed.\nprogress-deadline: \"600s\"\n\n# Sets the queue proxy's CPU request.\n# If omitted, a default value (currently \"25m\"), is used.\nqueue-sidecar-cpu-request: \"25m\"\n\n# Sets the queue proxy's CPU limit.\n# If omitted, no value is specified and the system default is used.\nqueue-sidecar-cpu-limit: \"1000m\"\n\n# Sets the queue proxy's memory request.\n# If omitted, no value is specified and the system default is used.\nqueue-sidecar-memory-request: \"400Mi\"\n\n# Sets the queue proxy's memory limit.\n# If omitted, no value is specified and the system default is used.\nqueue-sidecar-memory-limit: \"800Mi\"\n\n# Sets the queue proxy's ephemeral storage request.\n# If omitted, no value is specified and the system default is used.\nqueue-sidecar-ephemeral-storage-request: \"512Mi\"\n\n# Sets the queue proxy's ephemeral storage limit.\n# If omitted, no value is specified and the system default is used.\nqueue-sidecar-ephemeral-storage-limit: \"1024Mi\"\n\n# Sets tokens associated with specific audiences for queue proxy - used by QPOptions\n#\n# For example, to add the `service-x` audience:\n#    queue-sidecar-token-audiences: \"service-x\"\n# Also supports a list of audiences, for example:\n#    queue-sidecar-token-audiences: \"service-x,service-y\"\n# If omitted, or empty, no tokens are created\nqueue-sidecar-token-audiences: \"\"\n\n# Sets rootCA for the queue proxy - used by QPOptions\n# If omitted, or empty, no rootCA is added to the golang rootCAs\nqueue-sidecar-rootca: \"\"","queue-sidecar-image":"gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a"},"kind":"ConfigMap","metadata":{"annotations":{"knative.dev/example-checksum":"410041a0"},"labels":{"app.kubernetes.io/component":"controller","app.kubernetes.io/name":"knative-serving","app.kubernetes.io/version":"1.10.2"},"name":"config-deployment","namespace":"knative-serving","ownerReferences":[{"apiVersion":"operator.knative.dev/v1beta1","blockOwnerDeletion":true,"controller":true,"kind":"KnativeServing","name":"knative-serving","uid":"8994c6ac-bdb5-476e-a791-cc493d0481e0"}]}}
    manifestival: new
  creationTimestamp: "2024-06-05T11:36:05Z"
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: knative-serving
    app.kubernetes.io/version: 1.10.2
  name: config-deployment
  namespace: knative-serving
  ownerReferences:
  - apiVersion: operator.knative.dev/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: KnativeServing
    name: knative-serving
    uid: 8994c6ac-bdb5-476e-a791-cc493d0481e0
  resourceVersion: "17546"
  uid: 05551649-ea37-4961-b776-ed49069a7f1e
NohaIhab commented 3 months ago

I was not able to reproduce this issue in airgapped today. I am seeing this error when trying to apply the knative service:

Error from server (InternalError): error when creating "ksvc.yaml": Internal error occurred: failed calling webhook "webhook.serving.knative.dev": failed to call webhook: Post "https://webhook.knative-serving.svc:443/?timeout=10s": dial tcp 10.152.183.91:443: connect: connection refused

I looked into it and filed #185 , after setting the images specified in #185 I no longer see the error above so I can now start playing with editing the configmap and the Image resource

NohaIhab commented 3 months ago

I modified the KnativeServing.yaml.j2 file locally by adding to the spec.config.deployment:

queue-sidecar-image: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/queue:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a

to test out if this is what we need for the KnativeService workload. Now I see the queue-proxy image set correctly in the pod:

KnativeService workload pod description ``` kubectl describe po -nadmin helloworld-00001-deployment-dbc8767d4-dx72q Name: helloworld-00001-deployment-dbc8767d4-dx72q Namespace: admin Priority: 0 Service Account: default Node: airgapped-microk8s/10.254.162.100 Start Time: Thu, 06 Jun 2024 09:34:48 +0000 Labels: app=helloworld-00001 pod-template-hash=dbc8767d4 security.istio.io/tlsMode=istio service.istio.io/canonical-name=helloworld service.istio.io/canonical-revision=helloworld-00001 serving.knative.dev/configuration=helloworld serving.knative.dev/configurationGeneration=1 serving.knative.dev/configurationUID=c0074455-8e69-407b-a2a3-e8a485298fc4 serving.knative.dev/revision=helloworld-00001 serving.knative.dev/revisionUID=fdf829b1-6b34-48cc-acd2-24e5cb44cf28 serving.knative.dev/service=helloworld serving.knative.dev/serviceUID=5efd018d-52d4-41e1-b145-6a8e91e781a7 Annotations: cni.projectcalico.org/containerID: 8971dbb25490761f17413bcd41daa04bf05ea61e953e05c690f7d877da15f198 cni.projectcalico.org/podIP: 10.1.205.185/32 cni.projectcalico.org/podIPs: 10.1.205.185/32 kubectl.kubernetes.io/default-container: user-container kubectl.kubernetes.io/default-logs-container: user-container prometheus.io/path: /stats/prometheus prometheus.io/port: 15020 prometheus.io/scrape: true serving.knative.dev/creator: admin sidecar.istio.io/status: {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["workload-socket","credential-socket","workload-certs","istio-env... Status: Running IP: 10.1.205.185 IPs: IP: 10.1.205.185 Controlled By: ReplicaSet/helloworld-00001-deployment-dbc8767d4 Init Containers: istio-init: Container ID: containerd://49418a7abdd082d4779564f257a7a19999b25fbb1fad4d42fd486a57599d7645 Image: 172.17.0.2:5000/istio/proxyv2:1.17.3 Image ID: 172.17.0.2:5000/istio/proxyv2@sha256:ea9373309e35569cf2a011e973aa1f49e0354fd2d730e8a0d5cc25964499a100 Port: Host Port: Args: istio-iptables -p 15001 -z 15006 -u 1337 -m REDIRECT -i * -x -b * -d 15090,15021,15020 --log_output_level=default:info State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 06 Jun 2024 09:34:49 +0000 Finished: Thu, 06 Jun 2024 09:34:49 +0000 Ready: True Restart Count: 0 Limits: cpu: 2 memory: 1Gi Requests: cpu: 100m memory: 128Mi Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g264f (ro) Containers: user-container: Container ID: containerd://ef220fe7d9d7333064b9564e615ae40d87cda530f57da85b1ad464f24f644963 Image: 172.17.0.2:5000/python@sha256:df44c0c0761ddbd6388f4549cab42d24d64d257c2a960ad5b276bb7dab9639c7 Image ID: 172.17.0.2:5000/python@sha256:df44c0c0761ddbd6388f4549cab42d24d64d257c2a960ad5b276bb7dab9639c7 Port: 8080/TCP Host Port: 0/TCP Command: sleep 3600 State: Running Started: Thu, 06 Jun 2024 09:34:51 +0000 Ready: True Restart Count: 0 Environment: TARGET: World PORT: 8080 K_REVISION: helloworld-00001 K_CONFIGURATION: helloworld K_SERVICE: helloworld Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g264f (ro) queue-proxy: Container ID: containerd://6fd63941bed441263c860ea2e3bf83ff52d1fef7ad107a1ab9f9af144bf9286e Image: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/queue:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a Image ID: 172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/queue@sha256:20a2e88595a443f1fe9db8c41cc985e4139f7e587d9a48ad0f30519cacd831f5 Ports: 8022/TCP, 9090/TCP, 9091/TCP, 8012/TCP, 8112/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP State: Running Started: Thu, 06 Jun 2024 09:34:52 +0000 Ready: False Restart Count: 0 Requests: cpu: 25m Readiness: http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: SERVING_NAMESPACE: admin SERVING_SERVICE: helloworld SERVING_CONFIGURATION: helloworld SERVING_REVISION: helloworld-00001 QUEUE_SERVING_PORT: 8012 QUEUE_SERVING_TLS_PORT: 8112 CONTAINER_CONCURRENCY: 0 REVISION_TIMEOUT_SECONDS: 300 REVISION_RESPONSE_START_TIMEOUT_SECONDS: 0 REVISION_IDLE_TIMEOUT_SECONDS: 0 SERVING_POD: helloworld-00001-deployment-dbc8767d4-dx72q (v1:metadata.name) SERVING_POD_IP: (v1:status.podIP) SERVING_LOGGING_CONFIG: SERVING_LOGGING_LEVEL: SERVING_REQUEST_LOG_TEMPLATE: {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"} SERVING_ENABLE_REQUEST_LOG: false SERVING_REQUEST_METRICS_BACKEND: prometheus TRACING_CONFIG_BACKEND: none TRACING_CONFIG_ZIPKIN_ENDPOINT: TRACING_CONFIG_DEBUG: false TRACING_CONFIG_SAMPLE_RATE: 0.1 USER_PORT: 8080 SYSTEM_NAMESPACE: knative-serving METRICS_DOMAIN: knative.dev/internal/serving SERVING_READINESS_PROBE: {"tcpSocket":{"port":8080,"host":"127.0.0.1"},"successThreshold":1} ENABLE_PROFILING: false SERVING_ENABLE_PROBE_REQUEST_LOG: false METRICS_COLLECTOR_ADDRESS: HOST_IP: (v1:status.hostIP) ENABLE_HTTP2_AUTO_DETECTION: false ROOT_CA: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g264f (ro) istio-proxy: Container ID: containerd://a88c348293484519e7ae0f76fd0552b1adcacb8eff86d0a8154d6a2075aeeaa0 Image: 172.17.0.2:5000/istio/proxyv2:1.17.3 Image ID: 172.17.0.2:5000/istio/proxyv2@sha256:ea9373309e35569cf2a011e973aa1f49e0354fd2d730e8a0d5cc25964499a100 Port: 15090/TCP Host Port: 0/TCP Args: proxy sidecar --domain $(POD_NAMESPACE).svc.cluster.local --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2 State: Running Started: Thu, 06 Jun 2024 09:34:52 +0000 Ready: True Restart Count: 0 Limits: cpu: 2 memory: 1Gi Requests: cpu: 100m memory: 128Mi Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30 Environment: JWT_POLICY: third-party-jwt PILOT_CERT_PROVIDER: istiod CA_ADDR: istiod.kubeflow.svc:15012 POD_NAME: helloworld-00001-deployment-dbc8767d4-dx72q (v1:metadata.name) POD_NAMESPACE: admin (v1:metadata.namespace) INSTANCE_IP: (v1:status.podIP) SERVICE_ACCOUNT: (v1:spec.serviceAccountName) HOST_IP: (v1:status.hostIP) PROXY_CONFIG: {"discoveryAddress":"istiod.kubeflow.svc:15012","tracing":{"zipkin":{"address":"zipkin.kubeflow:9411"}}} ISTIO_META_POD_PORTS: [ {"name":"user-port","containerPort":8080,"protocol":"TCP"} ,{"name":"http-queueadm","containerPort":8022,"protocol":"TCP"} ,{"name":"http-autometric","containerPort":9090,"protocol":"TCP"} ,{"name":"http-usermetric","containerPort":9091,"protocol":"TCP"} ,{"name":"queue-port","containerPort":8012,"protocol":"TCP"} ,{"name":"https-port","containerPort":8112,"protocol":"TCP"} ] ISTIO_META_APP_CONTAINERS: user-container,queue-proxy ISTIO_META_CLUSTER_ID: Kubernetes ISTIO_META_NODE_NAME: (v1:spec.nodeName) ISTIO_META_INTERCEPTION_MODE: REDIRECT ISTIO_META_WORKLOAD_NAME: helloworld-00001-deployment ISTIO_META_OWNER: kubernetes://apis/apps/v1/namespaces/admin/deployments/helloworld-00001-deployment ISTIO_META_MESH_ID: cluster.local TRUST_DOMAIN: cluster.local ISTIO_KUBE_APP_PROBERS: {"/app-health/queue-proxy/readyz":{"httpGet":{"path":"/","port":8012,"scheme":"HTTP","httpHeaders":[{"name":"K-Network-Probe","value":"queue"}]},"timeoutSeconds":1}} Mounts: /etc/istio/pod from istio-podinfo (rw) /etc/istio/proxy from istio-envoy (rw) /var/lib/istio/data from istio-data (rw) /var/run/secrets/credential-uds from credential-socket (rw) /var/run/secrets/istio from istiod-ca-cert (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g264f (ro) /var/run/secrets/tokens from istio-token (rw) /var/run/secrets/workload-spiffe-credentials from workload-certs (rw) /var/run/secrets/workload-spiffe-uds from workload-socket (rw) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: workload-socket: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: credential-socket: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: workload-certs: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: istio-envoy: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Memory SizeLimit: istio-data: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: istio-podinfo: Type: DownwardAPI (a volume populated by information about the pod) Items: metadata.labels -> labels metadata.annotations -> annotations istio-token: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 43200 istiod-ca-cert: Type: ConfigMap (a volume populated by a ConfigMap) Name: istio-ca-root-cert Optional: false kube-api-access-g264f: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 2m10s (x3 over 2m30s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503 ```

so this fixes the ImagePullBackOff error we were seeing earlier.

I also saw the Image resource being set correctly, so we don't need to patch it:

kubectl get Images -nknative-serving
NAME          IMAGE
queue-proxy   172.17.0.2:5000/knative-releases/knative.dev/serving/cmd/queue:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a

I will be sending a PR to add the queue-sidecar-image to the KnativeServing.yaml.j2 manifest template

NohaIhab commented 3 months ago

After configuring the queue-sidecar-image correctly, the KnativeService is still not Ready as desired. The workload pod is stuck with 2/3 Ready containers The queue-proxy container is not Ready, but not due to ImagePullBackOff. In the pod description it says:

  Warning  Unhealthy  2m55s (x7 over 4m8s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  Unhealthy  2m54s (x5 over 4m9s)  kubelet            Readiness probe failed: Get "http://10.1.205.185:15020/app-health/queue-proxy/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

The logs of the queue-proxy container are:

kubectl logs -nadmin helloworld-00001-deployment-dbc8767d4-dx72q -c queue-proxy
{"severity":"INFO","timestamp":"2024-06-06T09:34:52.152001373Z","logger":"queueproxy","caller":"sharedmain/main.go:259","message":"Starting queue-proxy","commit":"500756c","knative.dev/key":"admin/helloworld-00001","knative.dev/pod":"helloworld-00001-deployment-dbc8767d4-dx72q"}
{"severity":"INFO","timestamp":"2024-06-06T09:34:52.152352674Z","logger":"queueproxy","caller":"sharedmain/main.go:265","message":"Starting http server metrics:9090","commit":"500756c","knative.dev/key":"admin/helloworld-00001","knative.dev/pod":"helloworld-00001-deployment-dbc8767d4-dx72q"}
{"severity":"INFO","timestamp":"2024-06-06T09:34:52.152348217Z","logger":"queueproxy","caller":"sharedmain/main.go:265","message":"Starting http server main:8012","commit":"500756c","knative.dev/key":"admin/helloworld-00001","knative.dev/pod":"helloworld-00001-deployment-dbc8767d4-dx72q"}
{"severity":"INFO","timestamp":"2024-06-06T09:34:52.15236817Z","logger":"queueproxy","caller":"sharedmain/main.go:265","message":"Starting http server admin:8022","commit":"500756c","knative.dev/key":"admin/helloworld-00001","knative.dev/pod":"helloworld-00001-deployment-dbc8767d4-dx72q"}
aggressive probe error (failed 202 times): dial tcp 127.0.0.1:8080: connect: connection refused
timed out waiting for the condition
aggressive probe error (failed 202 times): dial tcp 127.0.0.1:8080: connect: connection refused
timed out waiting for the condition
aggressive probe error (failed 202 times): dial tcp 127.0.0.1:8080: connect: connection refused
timed out waiting for the condition
aggressive probe error (failed 202 times): dial tcp 127.0.0.1:8080: connect: connection refused
timed out waiting for the condition
aggressive probe error (failed 202 times): dial tcp 127.0.0.1:8080: connect: connection refused
timed out waiting for the condition

I'm not sure now why this is happening so I'm looking into it

NohaIhab commented 3 months ago

I tried running the dummy ksvc that's in the description in a non-airgapped environment and I was seeing the same error. Therefore, this is irrelevant to this issue and I will explore the ksvc example in https://github.com/canonical/bundle-kubeflow/issues/917

NohaIhab commented 3 months ago

PR is now open #186 to add the config for queue image

NohaIhab commented 3 months ago

closed by #186 and port-forwarded to main in #189