canonical / notebook-operators

Charmed Jupyter Notebooks
Apache License 2.0
5 stars 9 forks source link

Add alert rules to jupyter-controller based on the KF093 spec #402

Closed rgildein closed 3 weeks ago

rgildein commented 1 month ago

These alert rules provide an overview of all service states.

Using the KubeflowServiceDown or KubeflowServiceIsNotStable filter, the user can easily see the status of all Kubeflow services.

These changes can be tested by running the following commands:

tox -e integration -- --keep-models
juju -m <model-name> show-unit grafana-agent-k8s/0 --endpoint metrics-endpoint | yq '.[]."relation-info".[]."application-data".alert_rules | fromjson'
# if you have cos deployed
juju consume <controller>:cos.remote-write
juju integrate remote-write grafana-agent-k8s
# open Grafana UI and check if KubeflowServiceDown and KubeflowServiceIsNotStable are there
# you can even stop pebble service to see alert rules firing
kubectl exec -it -n <model-name> pod/<pod-name> -c <workload-container> -- /charm/bin/pebble start <service-name>

part-of: #1026