Closed dparv closed 4 months ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5874.
This message was autogenerated
Wrong repo btw. should be in https://github.com/canonical/kserve-operators/
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5886.
This message was autogenerated
Another run today, with a different error:
unit-kserve-controller-0: 07:26:43 INFO unit.kserve-controller/0.juju-log local-gateway:40: HTTP Request: PATCH https://10.0.0.1/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/inferencegraph.serving.kserve.io?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"
unit-kserve-controller-0: 07:26:43 INFO unit.kserve-controller/0.juju-log local-gateway:40: HTTP Request: PATCH https://10.0.0.1/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/inferenceservice.serving.kserve.io?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"
unit-kserve-controller-0: 07:26:44 INFO unit.kserve-controller/0.juju-log local-gateway:40: HTTP Request: PATCH https://10.0.0.1/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/trainedmodel.serving.kserve.io?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"
unit-kserve-controller-0: 07:26:44 INFO unit.kserve-controller/0.juju-log local-gateway:40: Reconcile completed successfully
unit-kserve-controller-0: 07:26:44 INFO unit.kserve-controller/0.juju-log local-gateway:40: Rendering manifests
unit-kserve-controller-0: 07:26:44 INFO unit.kserve-controller/0.juju-log local-gateway:40: HTTP Request: PATCH https://10.0.0.1/api/v1/namespaces/kubeflow/configmaps/inferenceservice-config?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"
unit-kserve-controller-0: 07:26:44 INFO unit.kserve-controller/0.juju-log local-gateway:40: Reconcile completed successfully
unit-kserve-controller-0: 07:26:44 ERROR unit.kserve-controller/0.juju-log local-gateway:40: Uncaught exception while in charm code:
Traceback (most recent call last):
File "./src/charm.py", line 702, in <module>
main(KServeControllerCharm)
File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/main.py", line 441, in main
_emit_charm_event(charm, dispatcher.event_name)
File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 342, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 839, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 928, in _reemit
custom_handler(event)
File "./src/charm.py", line 543, in _on_local_gateway_relation_changed
self._on_install(event)
File "./src/charm.py", line 495, in _on_install
self._restart_controller_service()
File "./src/charm.py", line 694, in _restart_controller_service
self.controller_container.restart(self._controller_container_name)
File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/model.py", line 2045, in restart
self._pebble.restart_services(service_names)
File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/pebble.py", line 1746, in restart_services
return self._services_action('restart', services, timeout, delay)
File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/pebble.py", line 1767, in _services_action
raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "kserve-controller" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2024-06-24T07:26:44Z INFO Service "kserve-controller" has never been started.
----- Logs from task 1 -----
2024-06-24T07:26:44Z INFO Most recent service output:
{"level":"info","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"Setting up client for manager"}
{"level":"info","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"Setting up manager"}
{"level":"info","ts":"2024-06-24T07:26:44Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"Registering Components."}
{"level":"info","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"Setting up KServe v1alpha1 scheme"}
{"level":"info","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"Setting up KServe v1beta1 scheme"}
{"level":"error","ts":"2024-06-24T07:26:44Z","logger":"entrypoint","msg":"unable to get deploy config.","error":"configmaps \"inferenceservice-config\" not found","stacktrace":"main.main\n\t/root/parts/controller/build/cmd/manager/main.go:141\nruntime.main\n\t/snap/go/current/src/runtime/proc.go:250"}
2024-06-24T07:26:44Z ERROR cannot start service: exited quickly with code 1
-----
unit-kserve-controller-0: 07:26:44 ERROR juju.worker.uniter.operation hook "local-gateway-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1
unit-kserve-controller-0: 07:26:44 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
unit-kserve-controller-0: 07:25:51 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
unit-kserve-controller-0: 07:26:37 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
This is the same issue described in https://github.com/canonical/kserve-operators/issues/176 and https://github.com/canonical/kserve-operators/issues/229.
Looking at the upstream code, we see that:
NewDeployConfig
functionconstants.KServeNamespace
kserve
. This envvar is set by our charm here. However, ssh
ing into the container, we see that none of the envvars are set there:
root@kserve-controller-0:/var/lib/juju# pebble plan
services:
kserve-controller:
summary: KServe controller manager service
startup: enabled
override: replace
command: /manager
To verify that this is the root cause, I created the confimap in kserve
namespace and the charm moved forward from that error and hit a second error probably related to the secret envvar missing (logs)
After some investigation, it looks like the issue is caused when the rockcraft pebble layer prevails and overwrite the charm's one, which means that there are no environment variables set. This is confirmed by looking at a healthy deployment where kserve-controller is working and the pebble's layer applied is the one from the charm
root@kserve-controller-0:/var/lib/juju# pebble plan
services:
kserve-controller:
summary: KServe Controller
startup: enabled
override: replace
command: /manager --metrics-addr=:8080
environment:
POD_NAMESPACE: kubeflow
SECRET_NAME: kserve-webhook-server-cert
We are discussing with rockcraft team to understand why and when this is happening.
It seems that the above is NOT the root cause of why the controller is failing but just a side effect of it. The actual issue is that there's an execution path in the charm code where a specific series of events results in the charm trying to restart the service and failing before update_layer()
helper function (including add_layer()
and replan()
) runs. Then, the charm gets stuck there since the event's handler doesn't update the layer with the charm's one. An example of this execution is the following series:
1. install
2. ingress-gateway-relation-created
3. leader-elected
4. config-changed
5. start
6. ingress-gateway-relation-changed
7. ingress-gateway-relation-joined
8. ingress-gateway-relation-changed
9. local-gateway-relation-changed
To solve this, we 're moving this charm to one catch-all
event handler approach, as we had done for the charm in main
branch https://github.com/canonical/kserve-operators/pull/197.
In order to confirm the solution, we need to reproduce the issue. Note that since this issue is intermittent, it may require a lot of deployments until we hit this in action. For reproducing:
deploy-ckf-to-aks () {
az group create --name $1 --location westeurope
az aks create --resource-group $1 --name $2 --kubernetes-version 1.29 --node-count 2 --node-vm-size Standard_D8s_v3 --node-osdisk-size 100 --node-osdisk-type Managed --os-sku Ubuntu
az aks get-credentials --resource-group $1 --name $2 --admin
juju add-k8s aks --client
juju bootstrap aks $3
juju add-model kubeflow
juju deploy kubeflow --channel=$4 --trust
}
deploy-ckf-to-aks <ResourceGroupName> <AKSClusterName> aks-controller 1.8/stable
juju status kserve-controller --watch 1s`
juju resolved
doesn't help the charm unblock.
juju resolved kserve-controller/0
juju refresh
the charm, since this will take down the current charm and create a new pod, which doesn't guarantee that the issue would be hit again. Instead, we 'll use jhack sync
for this. Thus:
# git clone kserve-operators repo and `cd` into it
git checkout origin/kf-5886-cherry-pick
cd charms/kserve-controller
jhack sync kserve-controller/0
# respond yes to the prompt
juju resolved kserve-controller/0
Issue has been fixed by #246 and the charm was promoted to 0.11/stable
.
Bug Description
Fresh deployment:
All the rest of the charms are active/idle:
To Reproduce
juju deploy kubeflow --channel 1.8/stable
Environment
juju 3.4.3 Azure AKS 1.28.9
Relevant Log Output