Closed rgildein closed 3 months ago
Add metrics-endpoint relation to tensorboard-controller and simple alert rules for service being down.
$ cd charms/tensorboard-controller $ tox -e integration -- --keep-models --model kubeflow $ juju switch kubeflow $ juju exec --unit grafana-agent-k8s/0 -- curl localhost:12345/agent/api/v1/metrics/targets | jq '.data.[] | select(.labels.juju_charm == "tensorboard-controller") | .endpoint' "http://10.1.23.251:8080/metrics" $ juju exec --unit grafana-agent-k8s/0 -- curl http://10.1.23.251:8080/metrics # HELP controller_runtime_active_workers Number of currently used workers per controller # TYPE controller_runtime_active_workers gauge controller_runtime_active_workers{controller="tensorboard"} 0 # HELP controller_runtime_max_concurrent_reconciles Maximum number of concurrent reconciles per controller # TYPE controller_runtime_max_concurrent_reconciles gauge controller_runtime_max_concurrent_reconciles{controller="tensorboard"} 1
full metrics example
$ kubectl exec -it -n kubeflow pod/tensorboard-controller-0 -c tensorboard-controller -- /charm/bin/pebble stop tensorboard-controller $ juju exec --unit grafana-agent-k8s/0 -- curl localhost:12345/agent/api/v1/metrics/targets | jq '.data.[] | select(.labels.juju_charm == "tensorboard-controller") ... "scrape_error": "Get \"http://10.1.23.251:8080/metrics\": dial tcp 10.1.23.251:8080: connect: connection refused"
After 5 minutes you should see alert firing (cos deployment is required).
fixes:#122
Using fork
Add metrics-endpoint relation to tensorboard-controller and simple alert rules for service being down.
How to check metrics
full metrics example
How to check alert rule
After 5 minutes you should see alert firing (cos deployment is required).
fixes:#122