canonical / admission-webhook-operator

Admission Webhook Operator
Apache License 2.0
1 stars 4 forks source link

Admission-webhook failed to start service Error: open /etc/webhook/certs/cert.pem: no such file or directory #126

Closed alekseytivonchik closed 6 months ago

alekseytivonchik commented 6 months ago

Bug Description

when creating a Notebook, I get the error: statefulset/tst-note-t: create Pod tst-note-t-0 in StatefulSet tst-note-t failed error: Internal error occurred: failed calling webhook "admission-webhook.kubeflow.org": failed to call webhook: Post "https://admission-webhook.kubeflow.svc:4443/apply-poddefault?timeout=10s": dial tcp 10.233.59.161:4443: connect: connection refused

Снимок экрана 2024-02-13 в 15 28 37

To Reproduce

snap install juju juju bootstrap my-k8s juju add-model kubeflow juju deploy kubeflow --trust

Environment

K8s: v1.28.6 OS: Ubuntu 22.04 Juju: 3.3.1-genericlinux-amd64

admission-webhook: 1.8/stable argo-controller: 3.3.10/stable dex-auth: 2.36/stable envoy: 2.0/stable istio-ingressgateway: 1.17/stable istio-pilot: 1.17/stable jupyter-controller: 1.8/stable jupyter-ui: 1.8/stable katib-controller: 0.16/stable katib-db: 8.0/stable katib-db-manager: 0.16/stable katib-ui: 0.16/stable kfp-api: 2.0/stable kfp-db: 8.0/stable kfp-metadata-writer: 2.0/stable kfp-persistence: 2.0/stable kfp-profile-controller: 2.0/stable kfp-schedwf: 2.0/stable kfp-ui: 2.0/stable kfp-viewer: 2.0/stable kfp-viz: 2.0/stable knative-eventing: 1.10/stable knative-operator: 1.10/stable knative-serving: 1.10/stable kserve-controller: 0.11/stable kubeflow-dashboard: 1.8/stable kubeflow-profiles: 1.8/stable kubeflow-roles: 1.8/stable kubeflow-volumes: 1.8/stable metacontroller-operator: 3.0/stable minio: ckf-1.8/stable mlmd: 1.14/stable oidc-gatekeeper: ckf-1.8/stable pvcviewer-operator: 1.8/stable seldon-controller-manager: 1.17/stable tensorboard-controller: 1.8/stable tensorboards-web-app: 1.8/stable training-operator: 1.7/stable

Relevant Log Output

Unit                          Workload     Agent  Address         Ports          Message
admission-webhook/0*          maintenance  idle   10.233.118.141                 Workload failed health check
kubectl logs admission-webhook-0 -n kubeflow
<info messages...>
2024-02-13T11:53:36.729Z [container-agent] 2024-02-13 11:53:36 ERROR juju-log Traceback (most recent call last):
2024-02-13T11:53:36.729Z [container-agent]   File "/var/lib/juju/agents/unit-admission-webhook-0/charm/venv/charmed_kubeflow_chisme/pebble/_update_layer.py", line 31, in update_layer
2024-02-13T11:53:36.729Z [container-agent]     container.replan()
2024-02-13T11:53:36.729Z [container-agent]   File "/var/lib/juju/agents/unit-admission-webhook-0/charm/venv/ops/model.py", line 1915, in replan
2024-02-13T11:53:36.729Z [container-agent]     self._pebble.replan_services()
2024-02-13T11:53:36.729Z [container-agent]   File "/var/lib/juju/agents/unit-admission-webhook-0/charm/venv/ops/pebble.py", line 1680, in replan_services
2024-02-13T11:53:36.729Z [container-agent]     return self._services_action('replan', [], timeout, delay)
2024-02-13T11:53:36.729Z [container-agent]   File "/var/lib/juju/agents/unit-admission-webhook-0/charm/venv/ops/pebble.py", line 1761, in _services_action
2024-02-13T11:53:36.729Z [container-agent]     raise ChangeError(change.err, change)
2024-02-13T11:53:36.729Z [container-agent] ops.pebble.ChangeError: cannot perform the following tasks:
2024-02-13T11:53:36.729Z [container-agent] - Start service "admission-webhook" (cannot start service: exited quickly with code 255)
2024-02-13T11:53:36.729Z [container-agent] ----- Logs from task 0 -----
2024-02-13T11:53:36.729Z [container-agent] 2024-02-13T11:53:36Z INFO Most recent service output:
2024-02-13T11:53:36.729Z [container-agent]     F0213 11:53:36.706006      14 config.go:46] config=main.Config{CertFile:"/etc/webhook/certs/cert.pem", KeyFile:"/etc/webhook/certs/key.pem"} Error: open /etc/webhook/certs/cert.pem: no such file or directory
2024-02-13T11:53:36.729Z [container-agent] 2024-02-13T11:53:36Z ERROR cannot start service: exited quickly with code 255
<...info messages>

Additional Context

No response

syncronize-issues-to-jira[bot] commented 6 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5340.

This message was autogenerated

natalytvinova commented 6 months ago

Hitting the same issue

Barteus commented 6 months ago

I am hitting the same in the customer's environment.

Kubernetes: Charmed Kubeflow (1.29 charms, 1.28 API) Juju: 3.1.7-genericlinux-amd64 Kubeflow bundle: https://pastebin.canonical.com/p/d82h7ygYQr/

Here is my debug-log for webhook: https://pastebin.canonical.com/p/RSpgx2yD6b/

Pod: $ kubectl get po admission-webhook-0 -n kubeflow NAME READY STATUS RESTARTS AGE admission-webhook-0 2/2 Running 0 21h

NohaIhab commented 6 months ago

Can you confirm if you see in the juju debug-log the WARN log: Cannot upload certificates to container, deferring

Barteus commented 6 months ago

I do not see it in my juju debug-log also for all juju applications.

The SSL cert in my environment is provided and terminated via F5.

NohaIhab commented 6 months ago

Can you also provide:

Barteus commented 6 months ago

Admission-webhook logs: https://pastebin.canonical.com/p/JyHD6Mtctp/ Jupyter-controller logs: https://pastebin.canonical.com/p/FBMRGbptyz/ Services in kubeflow ns:

$ kubectl get po -n kubeflow
NAME                                             READY   STATUS    RESTARTS        AGE
admission-webhook-0                              2/2     Running   0               4h
argo-controller-0                                2/2     Running   0               25h
dex-auth-0                                       2/2     Running   0               25h
envoy-5fbb9ccbf9-7xhhc                           1/1     Running   0               25h
envoy-operator-0                                 1/1     Running   0               25h
grafana-agent-k8s-0                              2/2     Running   0               25h
istio-ingressgateway-0                           1/1     Running   0               25h
istio-ingressgateway-workload-696584c7c5-2tv9s   1/1     Running   0               25h
istio-pilot-0                                    1/1     Running   0               25h
istiod-7f696dd599-ptkrn                          1/1     Running   0               25h
jupyter-controller-0                             2/2     Running   0               25h
jupyter-ui-0                                     2/2     Running   0               25h
katib-controller-5cc6d58bfd-6k7tw                1/1     Running   0               25h
katib-controller-operator-0                      1/1     Running   0               25h
katib-db-0                                       2/2     Running   0               25h
katib-db-manager-0                               2/2     Running   0               25h
katib-ui-0                                       2/2     Running   3 (23h ago)     25h
kfp-api-0                                        2/2     Running   2 (24h ago)     25h
kfp-db-0                                         2/2     Running   0               25h
kfp-metadata-writer-0                            2/2     Running   0               25h
kfp-persistence-0                                2/2     Running   0               25h
kfp-profile-controller-0                         2/2     Running   0               25h
kfp-schedwf-0                                    2/2     Running   0               25h
kfp-ui-0                                         2/2     Running   1 (24h ago)     25h
kfp-viewer-0                                     2/2     Running   0               25h
kfp-viz-0                                        2/2     Running   2 (2d21h ago)   2d22h
knative-eventing-0                               1/1     Running   0               25h
knative-operator-0                               3/3     Running   0               25h
knative-serving-0                                1/1     Running   1 (24h ago)     25h
kserve-controller-0                              3/3     Running   1 (2d21h ago)   2d22h
kubeflow-dashboard-0                             2/2     Running   0               25h
kubeflow-profiles-0                              3/3     Running   7 (26h ago)     2d22h
kubeflow-roles-0                                 1/1     Running   0               25h
kubeflow-volumes-84f58746bf-rxr68                1/1     Running   0               25h
kubeflow-volumes-operator-0                      1/1     Running   0               25h
metacontroller-operator-0                        1/1     Running   0               25h
metacontroller-operator-charm-0                  1/1     Running   0               25h
minio-0                                          1/1     Running   0               2d22h
minio-operator-0                                 1/1     Running   0               25h
mlflow-mysql-0                                   2/2     Running   0               2d22h
mlflow-server-0                                  3/3     Running   0               25h
mlmd-0                                           1/1     Running   0               2d22h
mlmd-operator-0                                  1/1     Running   0               25h
modeloperator-56c9578cfb-nbrq5                   1/1     Running   0               25h
oidc-gatekeeper-0                                2/2     Running   0               2d22h
pvcviewer-operator-0                             3/3     Running   0               25h
resource-dispatcher-0                            2/2     Running   0               25h
seldon-controller-manager-0                      2/2     Running   4 (2d21h ago)   2d22h
tensorboard-controller-0                         2/2     Running   0               25h
tensorboards-web-app-0                           2/2     Running   0               25h
training-operator-0                              2/2     Running   0               25h
NohaIhab commented 6 months ago

@Barteus are you able to get the juju debug log for admission webhook from the very beginning? I need to see which hook failed to execute the debug logs you've attached are not from when the charm was deployed, you should see there juju.worker.uniter unit "admission-webhook/0" started

Barteus commented 6 months ago

The status of the charm is:

$  juju status admission-webhook
Model     Controller        Cloud/Region  Version  SLA          Timestamp
kubeflow  foundations-maas  ck8s/default  3.1.7    unsupported  14:12:33Z

SAAS        Status  Store             URL
grafana     active  foundations-maas  admin/cos.grafana
loki        active  foundations-maas  admin/cos.loki
prometheus  active  foundations-maas  admin/cos.prometheus

App                Version  Status   Scale  Charm              Channel     Rev  Address        Exposed  Message
admission-webhook           waiting      1  admission-webhook  1.8/stable  275  172.30.181.94  no       waiting for units to settle down

Unit                  Workload     Agent  Address         Ports  Message
admission-webhook/0*  maintenance  idle   10.128.236.170         Workload failed health check

To me, it looks like all hooks went ok, and the issue occurred later.

Barteus commented 6 months ago

Workaround to unstuck the admission-webhook:

juju ssh admission-webhook/0 

bash

export CONTAINER_NAME=admission-webhook
export PEBBLE_SOCKET=/charm/containers/$CONTAINER_NAME/pebble.socket

alias pebble=/charm/bin/pebble

# pebble plan
pebble services

pebble restart admission-webhook

After few minutes all notebooks are running!

Big thanks to @NohaIhab and @ca-scribner for help!

syncronize-issues-to-jira[bot] commented 6 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5392.

This message was autogenerated

NohaIhab commented 6 months ago

the fix #125 is now merged and promoted to 1.8/stable You can now refresh your charms, for reference the new revision is 301. cc: @natalytvinova @Barteus @alekseytivonchik