NLnetLabs / routinator

An RPKI Validator and RTR server written in Rust
https://nlnetlabs.nl/projects/routing/routinator/
BSD 3-Clause "New" or "Revised" License
465 stars 70 forks source link

`/metrics` responds with 503 during initial validation, consider returning an empty response #563

Closed ties closed 3 years ago

ties commented 3 years ago

As a user I monitor routinator with prometheus. While routinator is starting up it returns a HTTP 503 "Initial validation ongoing. Please wait.". This causes prometheus to think the service is down and sometimes causes an alert that checks for an instance that is down to fire if startup takes too long (or we are unlucky with sampling).

For an instance being down we have an alert on:

name: InstanceDownSlow
  expr: up{slow="true"} == 0
  for: 2m
  labels:
    severity: P1
  annotations:
    description: {{ $labels.instance }} of job {{ $labels.job }} has been down for {{ humanizeDuration $value }}.
    summary: Instance {{ $labels.instance }} down

For liveliness we check that routinator_last_update_done is not too high:

name: RoutinatorNotUpdating
expr: routinator_last_update_done > 1800
labels:
  severity: P2
annotations:
  description: routinator at {{ $labels.instance }} has not updated for {{ humanizeDuration $value }}.
  summary: routinator at {{ $labels.instance }} has not updated for {{ humanizeDuration $value }}.

It would be better for us if:

This would make it harder to alert for liveliness. For liveliness I would generally prefer a timestamp of the last validation:

# from octorpki
# HELP last_validation Timestamp of last validation.
# TYPE last_validation gauge
last_validation 1.621589039e+09

This can be initialised to 0 - you can just ignore the alert for a few minutes. This would result in an alert like:

name: RoutinatorNotUpdating
expr: time() - routinator_last_validation > 1200
for: 10m
labels:
  severity: P2
annotations:
  description: routinator at {{ $labels.instance }} has not validated for 30m.
  summary: routinator at {{ $labels.instance }} has not validated for 30m.
partim commented 3 years ago

To be fair, the service is down during that phase. RTR will not work nor will any of the HTTP endpoints produce anything other than a 503 (with the exception of /version, I guess). So, this may actually be a feature: something did go wrong if Routinator restarts and you might want to investigate.

ties commented 3 years ago

Fair point 👍.

In that case I will differentiate the alerts to accept a longer downtime for routinator.