canonical / knative-operators

Charmed Knative Operators
Apache License 2.0
1 stars 2 forks source link

Test Knative in airgapped CKF #140

Closed kimwnasptd closed 1 year ago

kimwnasptd commented 1 year ago

Right now we allow users configure the following images for Knative Serving/Eventing https://github.com/canonical/knative-operators/blob/5caa2db37c0a7366d45a7b6aaa6add9946eb04d7/charms/knative-serving/config.yaml#L25-L34 https://github.com/canonical/knative-operators/blob/5caa2db37c0a7366d45a7b6aaa6add9946eb04d7/charms/knative-eventing/config.yaml#L11-L15

But once I do a microk8s ctr images ls then I see the following relevant Knative images

gcr.io/knative-releases/knative.dev/eventing/cmd/broker/filter@sha256:33ea8a657b974d7bf3d94c0b601a4fc287c1fb33430b3dda028a1a189e3d9526
gcr.io/knative-releases/knative.dev/eventing/cmd/broker/ingress@sha256:f4a9dfce9eec5272c90a19dbdf791fffc98bc5a6649ee85cb8a29bd5145635b1
gcr.io/knative-releases/knative.dev/eventing/cmd/controller@sha256:cbc452f35842cc8a78240642adc1ebb11a4c4d7c143c8277edb49012f6cfc5d3
gcr.io/knative-releases/knative.dev/eventing/cmd/in_memory/channel_controller@sha256:3ced549336c7ccf3bb2adf23a558eb55bd1aec7be17837062d21c749dfce8ce5
gcr.io/knative-releases/knative.dev/eventing/cmd/in_memory/channel_dispatcher@sha256:e17bbdf951868359424cd0a0465da8ef44c66ba7111292444ce555c83e280f1a
gcr.io/knative-releases/knative.dev/eventing/cmd/mtchannel_broker@sha256:c5d3664780b394f6d3e546eb94c972965fbd9357da5e442c66455db7ca94124c
gcr.io/knative-releases/knative.dev/eventing/cmd/webhook@sha256:c9c582f530155d22c01b43957ae0dba549b1cc903f77ec6cc1acb9ae9085be62
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:2b484d982ef1a5d6ff93c46d3e45f51c2605c2e3ed766e20247d1727eb5ce918
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:59b6a46d3b55a03507c76a3afe8a4ee5f1a38f1130fd3d65c9fe57fff583fa8dgcr.io/knative-releases/knative.dev/pkg/apiextensions/storageversion/cmd/migrate@sha256:59431cf8337532edcd9a4bcd030591866cc867f13bee875d81757c960a53668d
gcr.io/knative-releases/knative.dev/pkg/apiextensions/storageversion/cmd/migrate@sha256:d0095787bc1687e2d8180b36a66997733a52f8c49c3e7751f067813e3fb54b66
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c3bbf3a96920048869dcab8e133e00f59855670b8a0bbca3d72ced2f512eb5e1
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler-hpa@sha256:7003443f0faabbaca12249aa16b73fa171bddf350abd826dd93b06f5080a146d
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:caae5e34b4cb311ed8551f2778cfca566a77a924a59b775bd516fa8b5e3c1d7f
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:38f9557f4d61ec79cc2cdbe76da8df6c6ae5f978a50a2847c22cc61aa240da95
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:a4ba0076df2efaca2eed561339e21b3a4ca9d90167befd31de882bff69639470
gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:763d648bf1edee2b4471b0e211dbc53ba2d28f92e4dae28ccd39af7185ef2c96
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:bc13765ba4895c0fa318a065392d05d0adc0e20415c739e0aacb3f56140bf9ae

From the above list of images reported in MicroK8s it seems a couple of images are not part of the Service CR. We'll have to make sur

  1. We know which images to use in the Serving CR
  2. The Serving CR is exposing all necessary field for configuring Knative for running in airgapped envs
kimwnasptd commented 1 year ago

As part of this effort we found out that the docs from Knative don't expose to us how to:

  1. Set the domain-mapping Deployment image
  2. Set the domain-mapping-webhook Deployment image

We need to find a way to configure these images in the KnativeServing CR https://knative.dev/docs/install/operator/configuring-serving-cr/#download-images-individually-without-secrets

kimwnasptd commented 1 year ago

Looking a little bit into the Knative Operator code I found out that it works the following way:

  1. It finds all Deployments that have ownerReferences to the KnativeServing CR https://github.com/knative/operator/blob/b46a2d38c7e60edcbead2337db0e2d108ca97f5b/pkg/reconciler/common/images.go#L59
  2. Get its PodSpec
  3. For each container in the pod spec it will check if there's a key for that image https://github.com/knative/operator/blob/main/pkg/reconciler/common/images.go#L107
  4. if there is, then it will use the spec.registry.override key and replace that container's image

this means that if the spec.registry.override key, of the KnativeServing CR, matches the name of a container in any Deployment, owned by that CR, then the operator will replace the image with the value from the registry

kimwnasptd commented 1 year ago

So with the above we can try setting the container names of the domain-mapping and domain-mapping-webhook Deployments and override them

orfeas-k commented 1 year ago

Looks like this is how the custom images feature has been designed to work, thus we can add domain-mapping and domain-mapping-webhook to charm's config value custom_images (which is a simple dictionary).

orfeas-k commented 1 year ago

As mentioned in https://github.com/canonical/bundle-kubeflow/issues/680, we bumped onto this https://github.com/canonical/knative-operators/issues/147 so for Knative-serving, we will be configuring it to use 1.8.0 (knative-eventing already uses 1.8.0).

orfeas-k commented 1 year ago

Deploying knative charms in an airgrap environment works as expected apart from the activator deployment in the namespace knative-serving. Although the pod starts running, its container is never ready and logs the following constantly

{"severity":"ERROR","timestamp":"2023-09-04T08:41:28.454200818Z","logger":"activator","caller":"websocket/connection.go:144","message":"Websocket connection could not be established","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f","error":"dial tcp: lookup autoscaler.knative-serving.svc.cluster.local: i/o timeout","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func1\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:144\nknative.dev/pkg/websocket.(*ManagedConnection).connect.func1\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:225\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:222\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:235\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:228\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:423\nknative.dev/pkg/websocket.(*ManagedConnection).connect\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:222\nknative.dev/pkg/websocket.NewDurableConnection.func2\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:162"}
{"severity":"ERROR","timestamp":"2023-09-04T08:41:28.787749703Z","logger":"activator","caller":"websocket/connection.go:191","message":"Failed to send ping message to ws://autoscaler.knative-serving.svc.cluster.local:8080","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f","error":"connection has not yet been established","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func3\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:191"}
{"severity":"WARNING","timestamp":"2023-09-04T08:41:31.05744278Z","logger":"activator","caller":"handler/healthz_handler.go:36","message":"Healthcheck failed: connection has not yet been established","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f"}

Trying to debug this, we also deployed the above charms in a non-airgapped environment and noticed that the pod has the same logs there too, but its container is being able to go to ready. Investigating this further, and inside the airgapped env, we noticed the following in the CoreDNS pod's logs:

[INFO] 10.1.205.153:40339 - 44253 "AAAA IN autoscaler.knative-serving.svc.cluster.local.lxd. udp 66 false 512" - - 0 2.000241772s
[INFO] 10.1.205.153:56166 - 44510 "A IN autoscaler.knative-serving.svc.cluster.local.lxd. udp 66 false 512" - - 0 2.000258023s
[ERROR] plugin/errors: 2 autoscaler.knative-serving.svc.cluster.local.lxd. AAAA: read udp 10.1.205.163:34994->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 autoscaler.knative-serving.svc.cluster.local.lxd. A: read udp 10.1.205.163:34020->8.8.4.4:53: i/o timeout

Looking at the above, we start to believe that this has to do with the way our airgrapped environment is being set up (more info about environment here https://github.com/canonical/bundle-kubeflow/pull/682):

From the above, we 've lead to believe that it could be that the request to 8.8.x.x slow response results in a timeout that blocks the container from going to READY.

Solution

Configure airgap environment to immediately reject requests towards outside the cluster.