Closed kimwnasptd closed 1 year ago
As part of this effort we found out that the docs from Knative don't expose to us how to:
domain-mapping
Deployment imagedomain-mapping-webhook
Deployment imageWe need to find a way to configure these images in the KnativeServing CR https://knative.dev/docs/install/operator/configuring-serving-cr/#download-images-individually-without-secrets
Looking a little bit into the Knative Operator code I found out that it works the following way:
spec.registry.override
key and replace that container's imagethis means that if the spec.registry.override
key, of the KnativeServing CR, matches the name of a container in any Deployment, owned by that CR, then the operator will replace the image with the value from the registry
So with the above we can try setting the container names of the domain-mapping
and domain-mapping-webhook
Deployments and override them
Looks like this is how the custom images feature has been designed to work, thus we can add domain-mapping
and domain-mapping-webhook
to charm's config value custom_images
(which is a simple dictionary).
As mentioned in https://github.com/canonical/bundle-kubeflow/issues/680, we bumped onto this https://github.com/canonical/knative-operators/issues/147 so for Knative-serving, we will be configuring it to use 1.8.0 (knative-eventing already uses 1.8.0).
Deploying knative charms in an airgrap environment works as expected apart from the activator
deployment in the namespace knative-serving
. Although the pod starts running, its container is never ready and logs the following constantly
{"severity":"ERROR","timestamp":"2023-09-04T08:41:28.454200818Z","logger":"activator","caller":"websocket/connection.go:144","message":"Websocket connection could not be established","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f","error":"dial tcp: lookup autoscaler.knative-serving.svc.cluster.local: i/o timeout","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func1\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:144\nknative.dev/pkg/websocket.(*ManagedConnection).connect.func1\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:225\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:222\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:235\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:228\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\tk8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:423\nknative.dev/pkg/websocket.(*ManagedConnection).connect\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:222\nknative.dev/pkg/websocket.NewDurableConnection.func2\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:162"}
{"severity":"ERROR","timestamp":"2023-09-04T08:41:28.787749703Z","logger":"activator","caller":"websocket/connection.go:191","message":"Failed to send ping message to ws://autoscaler.knative-serving.svc.cluster.local:8080","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f","error":"connection has not yet been established","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func3\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/websocket/connection.go:191"}
{"severity":"WARNING","timestamp":"2023-09-04T08:41:31.05744278Z","logger":"activator","caller":"handler/healthz_handler.go:36","message":"Healthcheck failed: connection has not yet been established","commit":"e82287d","knative.dev/controller":"activator","knative.dev/pod":"activator-768b674d7c-dzd6f"}
Trying to debug this, we also deployed the above charms in a non-airgapped environment and noticed that the pod has the same logs there too, but its container is being able to go to ready. Investigating this further, and inside the airgapped env, we noticed the following in the CoreDNS
pod's logs:
[INFO] 10.1.205.153:40339 - 44253 "AAAA IN autoscaler.knative-serving.svc.cluster.local.lxd. udp 66 false 512" - - 0 2.000241772s
[INFO] 10.1.205.153:56166 - 44510 "A IN autoscaler.knative-serving.svc.cluster.local.lxd. udp 66 false 512" - - 0 2.000258023s
[ERROR] plugin/errors: 2 autoscaler.knative-serving.svc.cluster.local.lxd. AAAA: read udp 10.1.205.163:34994->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 autoscaler.knative-serving.svc.cluster.local.lxd. A: read udp 10.1.205.163:34020->8.8.4.4:53: i/o timeout
Looking at the above, we start to believe that this has to do with the way our airgrapped environment is being set up (more info about environment here https://github.com/canonical/bundle-kubeflow/pull/682):
ndots = 5
setting in their /etc/resolv.conf
meaning that for query addresses with at least 5 dots in them, it will ignore its search
list and will try to resolve the address as a normal domain name. This is probable the reason the above address is being resolved to 8.8.8.8
or 8.8.4.4
. exec
ed into a pod and tried to hit autoscaler.knative-serving.svc.cluster.local.lxd(:8080)
and noticed that although the request fails, it takes some seconds before we get a response. In the non airgapped one, we get this response right away. Activator
deployment has a TimeoutThreshold
of 1 second. We tried to manipulate this but we believe it could be the deployment go
code that breaks the deployment.From the above, we 've lead to believe that it could be that the request to 8.8.x.x
slow response results in a timeout that blocks the container from going to READY
.
Configure airgap environment to immediately reject requests towards outside the cluster.
Right now we allow users configure the following images for Knative Serving/Eventing https://github.com/canonical/knative-operators/blob/5caa2db37c0a7366d45a7b6aaa6add9946eb04d7/charms/knative-serving/config.yaml#L25-L34 https://github.com/canonical/knative-operators/blob/5caa2db37c0a7366d45a7b6aaa6add9946eb04d7/charms/knative-eventing/config.yaml#L11-L15
But once I do a
microk8s ctr images ls
then I see the following relevant Knative imagesFrom the above list of images reported in MicroK8s it seems a couple of images are not part of the Service CR. We'll have to make sur