canonical / kserve-operators

Charmed KServe
4 stars 2 forks source link

kserve-controller in error with hook failed: "ingress-gateway-relation-changed" #242

Closed dparv closed 4 months ago

dparv commented 5 months ago

Bug Description

Fresh deployment:

kserve-controller                                   waiting      1  kserve-controller        0.11/stable      523  10.0.27.132   no       installing agent
...
kserve-controller/0*          error     idle   10.244.0.35                 hook failed: "ingress-gateway-relation-changed"

All the rest of the charms are active/idle:

Model     Controller  Cloud/Region      Version  SLA          Timestamp
kubeflow  manual      k8s-cloud/eastus  3.4.3    unsupported  10:23:04Z

SAAS        Status  Store  URL
grafana     active  local  admin/cos.grafana
prometheus  active  local  admin/cos.prometheus

App                        Version                  Status   Scale  Charm                    Channel          Rev  Address       Exposed  Message
admission-webhook                                   active       1  admission-webhook        1.8/stable       301  10.0.210.198  no       
argo-controller                                     active       1  argo-controller          3.3.10/stable    424  10.0.122.226  no       
dex-auth                                            active       1  dex-auth                 2.36/stable      422  10.0.24.104   no       
envoy                      res:oci-image@cc06b3e    active       1  envoy                    2.0/stable       194  10.0.154.19   no       
grafana-agent-k8s          0.40.4                   active       1  grafana-agent-k8s        latest/edge       80  10.0.152.106  no       logging-consumer: off
istio-ingressgateway                                active       1  istio-gateway            1.17/stable     1000  10.0.153.130  no       
istio-pilot                                         active       1  istio-pilot              1.17/stable      965  10.0.18.244   no       
jupyter-controller                                  active       1  jupyter-controller       1.8/stable       849  10.0.170.164  no       
jupyter-ui                                          active       1  jupyter-ui               1.8/stable       858  10.0.208.36   no       
katib-controller           res:oci-image@b6a6100    active       1  katib-controller         0.16/stable      446  10.0.236.113  no       
katib-db                   8.0.35-0ubuntu0.22.04.1  active       1  mysql-k8s                8.0/stable       127  10.0.212.122  no       
katib-db-manager                                    active       1  katib-db-manager         0.16/stable      411  10.0.244.22   no       
katib-ui                                            active       1  katib-ui                 0.16/stable      422  10.0.108.112  no       
kfp-api                                             active       1  kfp-api                  2.0/stable      1283  10.0.249.254  no       
kfp-db                     8.0.35-0ubuntu0.22.04.1  active       1  mysql-k8s                8.0/stable       127  10.0.60.140   no       
kfp-metadata-writer                                 active       1  kfp-metadata-writer      2.0/stable       334  10.0.186.249  no       
kfp-persistence                                     active       1  kfp-persistence          2.0/stable      1291  10.0.188.121  no       
kfp-profile-controller                              active       1  kfp-profile-controller   2.0/stable      1315  10.0.57.98    no       
kfp-schedwf                                         active       1  kfp-schedwf              2.0/stable      1302  10.0.219.73   no       
kfp-ui                                              active       1  kfp-ui                   2.0/stable      1285  10.0.198.243  no       
kfp-viewer                                          active       1  kfp-viewer               2.0/stable      1317  10.0.81.137   no       
kfp-viz                                             active       1  kfp-viz                  2.0/stable      1235  10.0.136.89   no       
knative-eventing                                    active       1  knative-eventing         1.10/stable      353  10.0.164.25   no       
knative-operator                                    active       1  knative-operator         1.10/stable      328  10.0.127.109  no       
knative-serving                                     active       1  knative-serving          1.10/stable      409  10.0.210.189  no       
kserve-controller                                   waiting      1  kserve-controller        0.11/stable      523  10.0.27.132   no       installing agent
kubeflow-dashboard                                  active       1  kubeflow-dashboard       1.8/stable       454  10.0.76.23    no       
kubeflow-profiles                                   active       1  kubeflow-profiles        1.8/stable       355  10.0.245.245  no       
kubeflow-roles                                      active       1  kubeflow-roles           1.8/stable       187  10.0.220.212  no       
kubeflow-volumes           res:oci-image@2261827    active       1  kubeflow-volumes         1.8/stable       260  10.0.165.37   no       
metacontroller-operator                             active       1  metacontroller-operator  3.0/stable       252  10.0.107.174  no       
minio                      res:oci-image@1755999    active       1  minio                    ckf-1.8/stable   278  10.0.130.152  no       
mlmd                       res:oci-image@44abc5d    active       1  mlmd                     1.14/stable      127  10.0.101.179  no       
oidc-gatekeeper                                     active       1  oidc-gatekeeper          ckf-1.8/stable   350  10.0.1.239    no       
pvcviewer-operator                                  active       1  pvcviewer-operator       1.8/stable        30  10.0.219.103  no       
seldon-controller-manager                           active       1  seldon-core              1.17/stable      664  10.0.213.200  no       
tensorboard-controller                              active       1  tensorboard-controller   1.8/stable       257  10.0.83.123   no       
tensorboards-web-app                                active       1  tensorboards-web-app     1.8/stable       245  10.0.99.136   no       
training-operator                                   active       1  training-operator        1.7/stable       347  10.0.65.139   no       

Unit                          Workload  Agent  Address      Ports          Message
admission-webhook/0*          active    idle   10.244.1.23                 
argo-controller/0*            active    idle   10.244.1.16                 
dex-auth/0*                   active    idle   10.244.1.13                 
envoy/0*                      active    idle   10.244.1.30  9090,9901/TCP  
grafana-agent-k8s/0*          active    idle   10.244.0.17                 logging-consumer: off
istio-ingressgateway/0*       active    idle   10.244.1.8                  
istio-pilot/0*                active    idle   10.244.2.10                 
jupyter-controller/0*         active    idle   10.244.2.11                 
jupyter-ui/0*                 active    idle   10.244.2.9                  
katib-controller/0*           active    idle   10.244.1.28  443,8080/TCP   
katib-db-manager/0*           active    idle   10.244.1.15                 
katib-db/0*                   active    idle   10.244.0.33                 Primary
katib-ui/0*                   active    idle   10.244.1.25                 
kfp-api/0*                    active    idle   10.244.0.21                 
kfp-db/0*                     active    idle   10.244.0.16                 Primary
kfp-metadata-writer/0*        active    idle   10.244.1.26                 
kfp-persistence/0*            active    idle   10.244.0.15                 
kfp-profile-controller/0*     active    idle   10.244.0.23                 
kfp-schedwf/0*                active    idle   10.244.0.19                 
kfp-ui/0*                     active    idle   10.244.1.14                 
kfp-viewer/0*                 active    idle   10.244.2.15                 
kfp-viz/0*                    active    idle   10.244.0.31                 
knative-eventing/0*           active    idle   10.244.1.9                  
knative-operator/0*           active    idle   10.244.1.11                 
knative-serving/0*            active    idle   10.244.1.7                  
kserve-controller/0*          error     idle   10.244.0.35                 hook failed: "ingress-gateway-relation-changed"
kubeflow-dashboard/0*         active    idle   10.244.0.32                 
kubeflow-profiles/0*          active    idle   10.244.1.27                 
kubeflow-roles/0*             active    idle   10.244.2.4                  
kubeflow-volumes/0*           active    idle   10.244.0.37  5000/TCP       
metacontroller-operator/0*    active    idle   10.244.0.18                 
minio/0*                      active    idle   10.244.2.24  9000-9001/TCP  
mlmd/0*                       active    idle   10.244.2.25  8080/TCP       
oidc-gatekeeper/0*            active    idle   10.244.2.21                 
pvcviewer-operator/0*         active    idle   10.244.0.34                 
seldon-controller-manager/0*  active    idle   10.244.0.20                 
tensorboard-controller/0*     active    idle   10.244.1.10                 
tensorboards-web-app/0*       active    idle   10.244.1.24                 
training-operator/0*          active    idle   10.244.0.28                 

To Reproduce

juju deploy kubeflow --channel 1.8/stable

Environment

juju 3.4.3 Azure AKS 1.28.9

Relevant Log Output

unit-kserve-controller-0: 10:15:51 ERROR unit.kserve-controller/0.juju-log ingress-gateway:45: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 702, in <module>
    main(KServeControllerCharm)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 342, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 839, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 928, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 538, in _on_ingress_gateway_relation_changed
    self._on_install(event)
  File "./src/charm.py", line 495, in _on_install
    self._restart_controller_service()
  File "./src/charm.py", line 694, in _restart_controller_service
    self.controller_container.restart(self._controller_container_name)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/model.py", line 2045, in restart
    self._pebble.restart_services(service_names)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/pebble.py", line 1746, in restart_services
    return self._services_action('restart', services, timeout, delay)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/pebble.py", line 1767, in _services_action
    raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "kserve-controller" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2024-06-13T10:15:51Z INFO Service "kserve-controller" has never been started.
----- Logs from task 1 -----
2024-06-13T10:15:51Z INFO Most recent service output:
    {"level":"info","ts":"2024-06-13T10:15:51Z","logger":"entrypoint","msg":"Setting up client for manager"}
    {"level":"info","ts":"2024-06-13T10:15:51Z","logger":"entrypoint","msg":"Setting up manager"}
    {"level":"info","ts":"2024-06-13T10:15:51Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
    {"level":"info","ts":"2024-06-13T10:15:51Z","logger":"entrypoint","msg":"Registering Components."}
    {"level":"info","ts":"2024-06-13T10:15:51Z","logger":"entrypoint","msg":"Setting up KServe v1alpha1 scheme"}
    {"level":"info","ts":"2024-06-13T10:15:51Z","logger":"entrypoint","msg":"Setting up KServe v1beta1 scheme"}
    {"level":"error","ts":"2024-06-13T10:15:51Z","logger":"entrypoint","msg":"unable to get deploy config.","error":"configmaps \"inferenceservice-config\" not found","stacktrace":"main.main\n\t/root/parts/controller/build/cmd/manager/main.go:141\nruntime.main\n\t/snap/go/current/src/runtime/proc.go:250"}
2024-06-13T10:15:51Z ERROR cannot start service: exited quickly with code 1


### Additional Context

_No response_
syncronize-issues-to-jira[bot] commented 5 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5874.

This message was autogenerated

dparv commented 5 months ago

Wrong repo btw. should be in https://github.com/canonical/kserve-operators/

syncronize-issues-to-jira[bot] commented 5 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5886.

This message was autogenerated

dparv commented 5 months ago

Another run today, with a different error:

unit-kserve-controller-0: 07:26:43 INFO unit.kserve-controller/0.juju-log local-gateway:40: HTTP Request: PATCH https://10.0.0.1/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/inferencegraph.serving.kserve.io?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"
unit-kserve-controller-0: 07:26:43 INFO unit.kserve-controller/0.juju-log local-gateway:40: HTTP Request: PATCH https://10.0.0.1/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/inferenceservice.serving.kserve.io?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"
unit-kserve-controller-0: 07:26:44 INFO unit.kserve-controller/0.juju-log local-gateway:40: HTTP Request: PATCH https://10.0.0.1/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/trainedmodel.serving.kserve.io?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"
unit-kserve-controller-0: 07:26:44 INFO unit.kserve-controller/0.juju-log local-gateway:40: Reconcile completed successfully
unit-kserve-controller-0: 07:26:44 INFO unit.kserve-controller/0.juju-log local-gateway:40: Rendering manifests
unit-kserve-controller-0: 07:26:44 INFO unit.kserve-controller/0.juju-log local-gateway:40: HTTP Request: PATCH https://10.0.0.1/api/v1/namespaces/kubeflow/configmaps/inferenceservice-config?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"
unit-kserve-controller-0: 07:26:44 INFO unit.kserve-controller/0.juju-log local-gateway:40: Reconcile completed successfully
unit-kserve-controller-0: 07:26:44 ERROR unit.kserve-controller/0.juju-log local-gateway:40: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 702, in <module>
    main(KServeControllerCharm)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 342, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 839, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 928, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 543, in _on_local_gateway_relation_changed
    self._on_install(event)
  File "./src/charm.py", line 495, in _on_install
    self._restart_controller_service()
  File "./src/charm.py", line 694, in _restart_controller_service
    self.controller_container.restart(self._controller_container_name)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/model.py", line 2045, in restart
    self._pebble.restart_services(service_names)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/pebble.py", line 1746, in restart_services
    return self._services_action('restart', services, timeout, delay)
  File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/pebble.py", line 1767, in _services_action
    raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "kserve-controller" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2024-06-24T07:26:44Z INFO Service "kserve-controller" has never been started.
----- Logs from task 1 -----
2024-06-24T07:26:44Z INFO Most recent service output:
    {"level":"info","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"Setting up client for manager"}
    {"level":"info","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"Setting up manager"}
    {"level":"info","ts":"2024-06-24T07:26:44Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
    {"level":"info","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"Registering Components."}
    {"level":"info","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"Setting up KServe v1alpha1 scheme"}
    {"level":"info","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"Setting up KServe v1beta1 scheme"}
    {"level":"error","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"unable to get deploy config.","error":"configmaps \"inferenceservice-config\" not found","stacktrace":"main.main\n\t/root/parts/controller/build/cmd/manager/main.go:141\nruntime.main\n\t/snap/go/current/src/runtime/proc.go:250"}
2024-06-24T07:26:44Z ERROR cannot start service: exited quickly with code 1
-----
unit-kserve-controller-0: 07:26:44 ERROR juju.worker.uniter.operation hook "local-gateway-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1
unit-kserve-controller-0: 07:26:44 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
unit-kserve-controller-0: 07:25:51 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
unit-kserve-controller-0: 07:26:37 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
orfeas-k commented 5 months ago

This is the same issue described in https://github.com/canonical/kserve-operators/issues/176 and https://github.com/canonical/kserve-operators/issues/229.

orfeas-k commented 4 months ago

Pebble plan doesn't have envvars set

Looking at the upstream code, we see that:

  1. Controller workload fails trying to get the deploy config here using NewDeployConfig function
  2. NewDeployConfig function is looking for the configmap in constants.KServeNamespace
  3. The constant gets its value from POD_NAMESPACE envvar or defaults to kserve.

This envvar is set by our charm here. However, sshing into the container, we see that none of the envvars are set there:

root@kserve-controller-0:/var/lib/juju# pebble plan   
services:
    kserve-controller:
        summary: KServe controller manager service
        startup: enabled
        override: replace
        command: /manager

To verify that this is the root cause, I created the confimap in kserve namespace and the charm moved forward from that error and hit a second error probably related to the secret envvar missing (logs)

orfeas-k commented 4 months ago

Rockcraft vs Charm pebble layer

After some investigation, it looks like the issue is caused when the rockcraft pebble layer prevails and overwrite the charm's one, which means that there are no environment variables set. This is confirmed by looking at a healthy deployment where kserve-controller is working and the pebble's layer applied is the one from the charm

root@kserve-controller-0:/var/lib/juju# pebble plan
services:
    kserve-controller:
        summary: KServe Controller
        startup: enabled
        override: replace
        command: /manager --metrics-addr=:8080
        environment:
            POD_NAMESPACE: kubeflow
            SECRET_NAME: kserve-webhook-server-cert

We are discussing with rockcraft team to understand why and when this is happening.

orfeas-k commented 4 months ago

Root cause

It seems that the above is NOT the root cause of why the controller is failing but just a side effect of it. The actual issue is that there's an execution path in the charm code where a specific series of events results in the charm trying to restart the service and failing before update_layer() helper function (including add_layer() and replan()) runs. Then, the charm gets stuck there since the event's handler doesn't update the layer with the charm's one. An example of this execution is the following series:

1. install
2. ingress-gateway-relation-created
3. leader-elected
4. config-changed
5. start
6. ingress-gateway-relation-changed
7. ingress-gateway-relation-joined
8. ingress-gateway-relation-changed
9. local-gateway-relation-changed

Solution

To solve this, we 're moving this charm to one catch-all event handler approach, as we had done for the charm in main branch https://github.com/canonical/kserve-operators/pull/197.

Confirming the solution

In order to confirm the solution, we need to reproduce the issue. Note that since this issue is intermittent, it may require a lot of deployments until we hit this in action. For reproducing:

  1. deploy CKF to AKS. Here's the guide for it https://charmed-kubeflow.io/docs/deploy-charmed-kubeflow-to-aks. To facilitate this, you could run the following function from your terminal:
    deploy-ckf-to-aks () {
    az group create --name $1 --location westeurope
    az aks create --resource-group $1 --name $2 --kubernetes-version 1.29 --node-count 2 --node-vm-size Standard_D8s_v3 --node-osdisk-size 100 --node-osdisk-type Managed --os-sku Ubuntu
    az aks get-credentials --resource-group $1 --name $2 --admin
    juju add-k8s aks --client
    juju bootstrap aks $3
    juju add-model kubeflow
    juju deploy kubeflow --channel=$4 --trust
    }
    deploy-ckf-to-aks <ResourceGroupName> <AKSClusterName> aks-controller 1.8/stable
  2. Wait for kserve-controller with
    juju status kserve-controller --watch 1s`
  3. If it presents the error above, you could try the following to confirm that juju resolved doesn't help the charm unblock.
    juju resolved kserve-controller/0
  4. Now that we 're here, we must not juju refresh the charm, since this will take down the current charm and create a new pod, which doesn't guarantee that the issue would be hit again. Instead, we 'll use jhack sync for this. Thus:
    # git clone kserve-operators repo and `cd` into it
    git checkout origin/kf-5886-cherry-pick
    cd charms/kserve-controller
    jhack sync kserve-controller/0
    # respond yes to the prompt
  5. Run juju resolved again and the charm should go to active
    juju resolved kserve-controller/0
orfeas-k commented 4 months ago

Issue has been fixed by #246 and the charm was promoted to 0.11/stable.