emissary-ingress / emissary

open source Kubernetes-native API gateway for microservices built on the Envoy Proxy
https://www.getambassador.io
Apache License 2.0
4.37k stars 685 forks source link

Api ext pod errors causing restart #5436

Closed BChancusi closed 11 months ago

BChancusi commented 11 months ago

Describe the bug

Applying Emissary as yaml and waiting on the deployment fails.

deployment.apps/emissary-apiext condition met Error from server: error when creating "STDIN": conversion webhook for getambassador.io/v2, Kind=Module failed: Post "https://emissary-apiext.emissary-system.svc:443/webhooks/crd-convert?timeout=30s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.46:8443, code 502: 502 Bad Gateway

Caused by the api-ext pod restarting post being flagged as ready/passing the wait check.

time="2023-11-16 13:12:46.4818" level=info msg="Emissary Ingress apiext (version \"3.9.0\")" func=github.com/emissary-ingress/emissary/v3/cmd/apiext.Main file="/go/cmd/apiext/main.go:16" CMD=apiext PID=1 time="2023-11-16 13:12:46.4827" level=info msg="APIEXT_LOGLEVEL=info" func="github.com/emissary-ingress/emissary/v3/pkg/apiext.(WebhookServer).Run" file="/go/pkg/apiext/server.go:89" CMD=apiext PID=1 time="2023-11-16 13:12:47.1958" level=info msg="Serving HTTPS on port 8443" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.ServeHTTPS file="/go/pkg/apiext/internal/serve.go:88" CMD=apiext PID=1 THREAD=/serve-https time="2023-11-16 13:12:47.2323" level=info msg="Configuring conversion for \"authservices.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:47.2439" level=info msg="Configuring conversion for \"consulresolvers.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:47.2524" level=info msg="Configuring conversion for \"devportals.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:47.2636" level=info msg="Configuring conversion for \"hosts.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:47.2789" level=info msg="Configuring conversion for \"kubernetesendpointresolvers.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:47.3973" level=info msg="Configuring conversion for \"kubernetesserviceresolvers.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:47.7970" level=info msg="Configuring conversion for \"logservices.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:48.1975" level=info msg="Configuring conversion for \"mappings.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:48.5978" level=info msg="Configuring conversion for \"modules.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:48.9970" level=info msg="Configuring conversion for \"ratelimitservices.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:49.3972" level=info msg="Configuring conversion for \"tcpmappings.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:49.7974" level=info msg="Configuring conversion for \"tlscontexts.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:50.1974" level=info msg="Configuring conversion for \"tracingservices.getambassador.io\"" func=github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.updateCRD file="/go/pkg/apiext/internal/inject.go:137" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:50.2629" level=info msg="GenServerCert(ctx, \"emissary-apiext.emissary-system.svc\") => generating new cert" func="github.com/emissary-ingress/emissary/v3/pkg/apiext/internal.(CA).GenServerCert" file="/go/pkg/apiext/internal/ca.go:221" CMD=apiext PID=1 THREAD=/serve-https time="2023-11-16 13:12:50.5982" level=error msg="goroutine \"/configure-crds\" exited with error: 13 errors:\n 1. customresourcedefinitions.apiextensions.k8s.io \"authservices.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 2. customresourcedefinitions.apiextensions.k8s.io \"consulresolvers.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 3. customresourcedefinitions.apiextensions.k8s.io \"devportals.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 4. customresourcedefinitions.apiextensions.k8s.io \"hosts.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 5. customresourcedefinitions.apiextensions.k8s.io \"kubernetesendpointresolvers.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 6. customresourcedefinitions.apiextensions.k8s.io \"kubernetesserviceresolvers.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 7. customresourcedefinitions.apiextensions.k8s.io \"logservices.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 8. customresourcedefinitions.apiextensions.k8s.io \"mappings.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 9. customresourcedefinitions.apiextensions.k8s.io \"modules.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 10. customresourcedefinitions.apiextensions.k8s.io \"ratelimitservices.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 11. customresourcedefinitions.apiextensions.k8s.io \"tcpmappings.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 12. customresourcedefinitions.apiextensions.k8s.io \"tlscontexts.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 13. customresourcedefinitions.apiextensions.k8s.io \"tracingservices.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope" func="github.com/datawire/dlib/dgroup.(Group).goWorkerCtx.func1.1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:380" CMD=apiext PID=1 THREAD=/configure-crds time="2023-11-16 13:12:50.5984" level=info msg="shutting down (gracefully)..." func="github.com/datawire/dlib/dgroup.(Group).launchSupervisors.func1" file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:238" CMD=apiext PID=1 THREAD=":shutdown_logger" time="2023-11-16 13:12:52.7344" level=info msg=" final goroutine statuses:" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:84" CMD=apiext PID=1 THREAD=":shutdown_status" time="2023-11-16 13:12:52.7345" level=info msg=" /configure-crds: exited with error" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=apiext PID=1 THREAD=":shutdown_status" time="2023-11-16 13:12:52.7345" level=info msg=" /serve-http : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=apiext PID=1 THREAD=":shutdown_status" time="2023-11-16 13:12:52.7345" level=info msg=" /serve-https : exited" func=github.com/datawire/dlib/dgroup.logGoroutineStatuses file="/go/vendor/github.com/datawire/dlib/dgroup/group.go:95" CMD=apiext PID=1 THREAD=":shutdown_status" time="2023-11-16 13:12:52.7347" level=error msg="shut down with error error: 13 errors:\n 1. customresourcedefinitions.apiextensions.k8s.io \"authservices.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 2. customresourcedefinitions.apiextensions.k8s.io \"consulresolvers.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 3. customresourcedefinitions.apiextensions.k8s.io \"devportals.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 4. customresourcedefinitions.apiextensions.k8s.io \"hosts.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 5. customresourcedefinitions.apiextensions.k8s.io \"kubernetesendpointresolvers.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 6. customresourcedefinitions.apiextensions.k8s.io \"kubernetesserviceresolvers.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 7. customresourcedefinitions.apiextensions.k8s.io \"logservices.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 8. customresourcedefinitions.apiextensions.k8s.io \"mappings.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 9. customresourcedefinitions.apiextensions.k8s.io \"modules.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 10. customresourcedefinitions.apiextensions.k8s.io \"ratelimitservices.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 11. customresourcedefinitions.apiextensions.k8s.io \"tcpmappings.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 12. customresourcedefinitions.apiextensions.k8s.io \"tlscontexts.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope\n 13. customresourcedefinitions.apiextensions.k8s.io \"tracingservices.getambassador.io\" is forbidden: User \"system:serviceaccount:emissary-system:emissary-apiext\" cannot update resource \"customresourcedefinitions/status\" in API group \"apiextensions.k8s.io\" at the cluster scope" func=github.com/emissary-ingress/emissary/v3/pkg/busy.Main file="/go/pkg/busy/busy.go:87" CMD=apiext PID=1

After the restart the pod runs normally.

To Reproduce Steps to reproduce the behavior:

  1. Apply crd yaml/api-ext deployment and main emissary yaml via script/directly after the precondition wait.

Expected behavior api-ext not restarting/ready status hardened so subsequent wait checks are valid.

Versions (please complete the following information):

Additional context Only occurs on latest release, previous 3.8.2 works correctly. Currently working around by using a sleep post wait to ensure the pod has had time to restart.

bmariesan commented 11 months ago

We're also experiencing the same issues on 3.9.1

cindymullins-dw commented 11 months ago

Can you tell us more about your environment? We're trying to replicate but are not seeing this behavior.

CheyiLin commented 11 months ago

Same issue here on 3.9.1 with EKS 1.27 while trying to rotate the CA by deleting the old CA secret and restarting the apiext deployment.

The apiext deployment keeps restarting again and again until we add the missing permission in #5449 to the cluster role.

bmariesan commented 11 months ago

@cindymullins-dw

Deployed using ArgoCD in EKS on a combination of spot vs on-demand instances. We cleanup everything including the certificate, do a fresh install and after a couple of hours (as pods start shifting around) it randomly starts crashing and any new Emissary instances fail to get past the apiext init container

bmariesan commented 11 months ago

Same issue here on 3.9.1 with EKS 1.27 while trying to rotate the CA by deleting the old CA secret and restarting the apiext deployment.

The apiext deployment keeps restarting again and again until we add the missing permission in #5449 to the cluster role.

looked at the logs and indeed it seems we've also had occurrences of this error but didn't give much attention to it, will update the rbac tomorrow to see if that fixes it for us, thanks!

BChancusi commented 11 months ago

Bug is quite insidious, the first pod run errors which then restarts the pod but it seemingly misses the failure line again as its already configured as its first in sequence so is skipped and allows pod to run. #5449 adds the necessary permission thats needed which have confirmed fixes the issue, however, all PRs are currently with build failures atm due to unrelated docker login failure

CheyiLin commented 11 months ago

the first pod run errors which then restarts the pod but it seemingly misses the failure line again as its already configured as its first in sequence so is skipped and allows pod to run.

@BChancusi Yeah, exact the situation we hit while trying to rotate the CA (delete the CA secret and restart the apiext deployment) multiple times.