Closed ties closed 3 years ago
To be fair, the service is down during that phase. RTR will not work nor will any of the HTTP endpoints produce anything other than a 503 (with the exception of /version, I guess). So, this may actually be a feature: something did go wrong if Routinator restarts and you might want to investigate.
Fair point 👍.
In that case I will differentiate the alerts to accept a longer downtime for routinator.
As a user I monitor routinator with prometheus. While routinator is starting up it returns a HTTP 503 "Initial validation ongoing. Please wait.". This causes prometheus to think the service is down and sometimes causes an alert that checks for an instance that is down to fire if startup takes too long (or we are unlucky with sampling).
For an instance being down we have an alert on:
For liveliness we check that
routinator_last_update_done
is not too high:It would be better for us if:
200
for/metrics
immediatelyThis would make it harder to alert for liveliness. For liveliness I would generally prefer a timestamp of the last validation:
This can be initialised to 0 - you can just ignore the alert for a few minutes. This would result in an alert like: