canonical / kserve-operators

Charmed KServe
4 stars 2 forks source link

Add alert rules to kserve-controller based on the KF093 spec #265

Closed rgildein closed 3 months ago

rgildein commented 3 months ago

These alert rules provide an overview of all service states.

Using the KubeflowServiceDown or KubeflowServiceIsNotStable filter, the user can easily see the status of all Kubeflow services.

These changes can be tested by running the following commands:

tox -e integration -- --keep-models
juju -m <model-name> show-unit grafana-agent-k8s/0 --endpoint metrics-endpoint | yq '.[]."relation-info".[]."application-data".alert_rules | fromjson'
# if you have cos deployed
juju consume <controller>:cos.remote-write
juju integrate remote-write grafana-agent-k8s
# open Grafana UI and check if KubeflowServiceDown and KubeflowServiceIsNotStable are there
# you can even stop pebble service to see alert rules firing
kubectl exec -it -n <model-name> pod/<pod-name> -c <workload-container> -- /charm/bin/pebble start <service-name>

part-of: #1026