fix: correctly configure one scrape job to avoid firig alerts
The metrics endpoint configuration had two scrape jobs, one for the
regular metrics endpoint, and a second one based on a dynamic list of
targets. The latter was causing the prometheus scraper to try and scrape
metrics from *:80/metrics, which is not a valid endpoint. This was
causing the UnitsUnavailable alert to fire constantly because that job
was reporting back that the endpoint was not available.
This new job was introduced by canonical/seldon-core-operator#94
with no apparent justification. Because the seldon charm has changed
since that PR, and the endpoint it is configuring is not valid, this
commit will remove the extra job.
This commit also refactors the MetricsEndpointProvider instantiation and
removes the metrics-port config option as this value should not change.
Finally, this commit changes the alert rule interval from 0m to 5m, as
this interval is more appropriate for production environments.
Part of canonical/bundle-kubeflow#564
Testing
Please refer to the steps to reproduce in this comment, just deploying this app. After deploying this app and cos-lite, relations and dependencies, and waiting a couple minutes (10 min) none of the alerts should fire (for this app only).
TODO
[x] refactor integration test case test_prometheus_grafana_integration
fix: correctly configure one scrape job to avoid firig alerts
The metrics endpoint configuration had two scrape jobs, one for the regular metrics endpoint, and a second one based on a dynamic list of targets. The latter was causing the prometheus scraper to try and scrape metrics from *:80/metrics, which is not a valid endpoint. This was causing the UnitsUnavailable alert to fire constantly because that job was reporting back that the endpoint was not available. This new job was introduced by canonical/seldon-core-operator#94 with no apparent justification. Because the seldon charm has changed since that PR, and the endpoint it is configuring is not valid, this commit will remove the extra job.
This commit also refactors the MetricsEndpointProvider instantiation and removes the metrics-port config option as this value should not change.
Finally, this commit changes the alert rule interval from 0m to 5m, as this interval is more appropriate for production environments.
Part of canonical/bundle-kubeflow#564
Testing
Please refer to the steps to reproduce in this comment, just deploying this app. After deploying this app and cos-lite, relations and dependencies, and waiting a couple minutes (10 min) none of the alerts should fire (for this app only).
TODO
test case test_prometheus_grafana_integration
track/ckf-1.8