canonical / kubeflow-tensorboards-operator

Tensorboards Operator
Apache License 2.0
2 stars 6 forks source link

Add metrics relation to tensorboard-controller #129

Closed rgildein closed 3 months ago

rgildein commented 3 months ago

Add metrics-endpoint relation to tensorboard-controller and simple alert rules for service being down.

How to check metrics

$ cd charms/tensorboard-controller
$ tox -e integration -- --keep-models --model kubeflow
$ juju switch kubeflow
$ juju exec --unit grafana-agent-k8s/0 -- curl localhost:12345/agent/api/v1/metrics/targets | jq '.data.[] | select(.labels.juju_charm == "tensorboard-controller") | .endpoint'
"http://10.1.23.251:8080/metrics"
$ juju exec --unit grafana-agent-k8s/0 -- curl http://10.1.23.251:8080/metrics
# HELP controller_runtime_active_workers Number of currently used workers per controller
# TYPE controller_runtime_active_workers gauge
controller_runtime_active_workers{controller="tensorboard"} 0
# HELP controller_runtime_max_concurrent_reconciles Maximum number of concurrent reconciles per controller
# TYPE controller_runtime_max_concurrent_reconciles gauge
controller_runtime_max_concurrent_reconciles{controller="tensorboard"} 1

full metrics example

How to check alert rule

$ kubectl exec -it -n kubeflow pod/tensorboard-controller-0 -c tensorboard-controller -- /charm/bin/pebble stop tensorboard-controller
$  juju exec --unit grafana-agent-k8s/0 -- curl localhost:12345/agent/api/v1/metrics/targets | jq '.data.[] | select(.labels.juju_charm == "tensorboard-controller") 
...
"scrape_error": "Get \"http://10.1.23.251:8080/metrics\": dial tcp 10.1.23.251:8080: connect: connection refused"

After 5 minutes you should see alert firing (cos deployment is required).

Screenshot from 2024-08-06 15-26-15

fixes:#122

rgildein commented 3 months ago

Using fork