canonical / kserve-operators

Charmed KServe
4 stars 2 forks source link

`kserve-controller` fails to start because `inferenceservice-config` is not found #229

Closed motjuste closed 4 months ago

motjuste commented 7 months ago

Bug Description

Solutions QA has had a few recent runs fail to deploy Charmed Kubeflow because kserver-controller charm keeps going into error state, especially since the release of rev 523 for the 0.11/stable channel.

To Reproduce

Solutions QA uses FCE for its deployments, but the steps are analogous.

  1. Deploy K8s 1.28 on AWS using Juju 2.9/candidate.
  2. juju deploy --trust kubeflow --channel latest/stable

kserve-controller should eventually settle down and become active / idle but it instead goes in error state.

Environment

Relevant Log Output

2024-03-28T02:57:05.590Z [container-agent] 2024-03-28 02:57:05 ERROR juju-log local-gateway:31: Uncaught exception while in charm code:
2024-03-28T02:57:05.590Z [container-agent] Traceback (most recent call last):
2024-03-28T02:57:05.590Z [container-agent]   File "./src/charm.py", line 702, in <module>
2024-03-28T02:57:05.590Z [container-agent]     main(KServeControllerCharm)
2024-03-28T02:57:05.590Z [container-agent]   File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/main.py", line 441, in main
2024-03-28T02:57:05.590Z [container-agent]     _emit_charm_event(charm, dispatcher.event_name)
2024-03-28T02:57:05.590Z [container-agent]   File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
2024-03-28T02:57:05.590Z [container-agent]     event_to_emit.emit(*args, **kwargs)
2024-03-28T02:57:05.590Z [container-agent]   File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 342, in emit
2024-03-28T02:57:05.590Z [container-agent]     framework._emit(event)
2024-03-28T02:57:05.590Z [container-agent]   File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 839, in _emit
2024-03-28T02:57:05.590Z [container-agent]     self._reemit(event_path)
2024-03-28T02:57:05.590Z [container-agent]   File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/framework.py", line 928, in _reemit
2024-03-28T02:57:05.590Z [container-agent]     custom_handler(event)
2024-03-28T02:57:05.590Z [container-agent]   File "./src/charm.py", line 543, in _on_local_gateway_relation_changed
2024-03-28T02:57:05.590Z [container-agent]     self._on_install(event)
2024-03-28T02:57:05.590Z [container-agent]   File "./src/charm.py", line 495, in _on_install
2024-03-28T02:57:05.590Z [container-agent]     self._restart_controller_service()
2024-03-28T02:57:05.590Z [container-agent]   File "./src/charm.py", line 694, in _restart_controller_service
2024-03-28T02:57:05.590Z [container-agent]     self.controller_container.restart(self._controller_container_name)
2024-03-28T02:57:05.590Z [container-agent]   File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/model.py", line 2045, in restart
2024-03-28T02:57:05.590Z [container-agent]     self._pebble.restart_services(service_names)
2024-03-28T02:57:05.590Z [container-agent]   File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/pebble.py", line 1746, in restart_services
2024-03-28T02:57:05.590Z [container-agent]     return self._services_action('restart', services, timeout, delay)
2024-03-28T02:57:05.590Z [container-agent]   File "/var/lib/juju/agents/unit-kserve-controller-0/charm/venv/ops/pebble.py", line 1767, in _services_action
2024-03-28T02:57:05.590Z [container-agent]     raise ChangeError(change.err, change)
2024-03-28T02:57:05.590Z [container-agent] ops.pebble.ChangeError: cannot perform the following tasks:
2024-03-28T02:57:05.590Z [container-agent] - Start service "kserve-controller" (cannot start service: exited quickly with code 1)
2024-03-28T02:57:05.590Z [container-agent] ----- Logs from task 0 -----
2024-03-28T02:57:05.590Z [container-agent] 2024-03-28T02:57:05Z INFO Service "kserve-controller" has never been started.
2024-03-28T02:57:05.590Z [container-agent] ----- Logs from task 1 -----
2024-03-28T02:57:05.590Z [container-agent] 2024-03-28T02:57:05Z INFO Most recent service output:
2024-03-28T02:57:05.590Z [container-agent]     {"level":"info","ts":"2024-03-28T02:57:05Z","logger":"entrypoint","msg":"Setting up client for manager"}
2024-03-28T02:57:05.590Z [container-agent]     {"level":"info","ts":"2024-03-28T02:57:05Z","logger":"entrypoint","msg":"Setting up manager"}
2024-03-28T02:57:05.590Z [container-agent]     {"level":"info","ts":"2024-03-28T02:57:05Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
2024-03-28T02:57:05.590Z [container-agent]     {"level":"info","ts":"2024-03-28T02:57:05Z","logger":"entrypoint","msg":"Registering Components."}
2024-03-28T02:57:05.590Z [container-agent]     {"level":"info","ts":"2024-03-28T02:57:05Z","logger":"entrypoint","msg":"Setting up KServe v1alpha1 scheme"}
2024-03-28T02:57:05.590Z [container-agent]     {"level":"info","ts":"2024-03-28T02:57:05Z","logger":"entrypoint","msg":"Setting up KServe v1beta1 scheme"}
2024-03-28T02:57:05.590Z [container-agent]     {"level":"error","ts":"2024-03-28T02:57:05Z","logger":"entrypoint","msg":"unable to get deploy config.","error":"configmaps \"inferenceservice-config\" not found","stacktrace":"main.main\n\t/root/parts/controller/build/cmd/manager/main.go:141\nruntime.main\n\t/snap/go/current/src/runtime/proc.go:250"}
2024-03-28T02:57:05.590Z [container-agent] 2024-03-28T02:57:05Z ERROR cannot start service: exited quickly with code 1


### Additional Context

Recent test runs at Solutions QA with this as root cause for failure:
- 06bd8cf1-95ed-41eb-8a08-13f9f6efee16
- c39ed7ea-0ed7-4df8-b19a-8c8deb6f20dc
syncronize-issues-to-jira[bot] commented 7 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5514.

This message was autogenerated

orfeas-k commented 4 months ago

Issue has been fixed by #246 and the charm was promoted to 0.11/stable.