Closed przemeklal closed 1 year ago
Possible solution would be to include following in snapcraft.yaml
's definition of the service:
restart-condition: always
restart-delay: 5s
Service defined like this will keep restarting forever and will come back online once the controller is again reachable. This will likely have an impact on the accuracy of status of the unit reported by juju status
as it will have high chance of checking service status sometimes between service is restarted but before it crashes (this could be partly mitigated by higher restart-delay
).
Another impact that needs to be considered is what it'll do to Prometheus scrapes. In the brief moment between service start and crash, the /metrics
endpoint returns empty response (with 200
code). I'll test what it does to Prometheus metrics and report back.
Note: It's important to include restart-delay
to prevent service reaching StartLimitBurst
within StartLimitIntervalSec
with fast restarts. If the burst limit is reached, systemd
will no longer try to restart the service. There's no way to set these two parameters directly from snapcraft.yaml
so our only option is set restart-delay
to not reach default restart limits (10 times in 5 seconds) (5 times in 10 seconds).
I'd use on-failure
rather than always
for the restart condition to: 1) further minimize the possibility of hitting unit start rate limit; 2) for better practices. Per systemd documentation:
Setting this to on-failure is the recommended choice for long-running services, in order to increase reliability by attempting automatic recovery from errors. For services that shall be able to terminate on their own choice (and avoid immediate restarting), on-abnormal is an alternative choice.
I don't mind either way. It does not make much difference for this particular service as the only way to exit "cleanly" is with KeyboardInterupt and that's really more for the development purposes, regular user would not interact with the service this way. If service is stopped with systemctl stop
, systemd
won't attempt restarting even it the restart policy is always
.
I created a PR #41 for this.
I didn't see any unintended impacts on the Prometheus scrapes when the response was empty. The results of the query for juju_machine_state
were just empty:
ubuntu@juju-96f59e-pje-1:~$ curl "http://127.0.0.1:9090/api/v1/query?query=juju_machine_state"
{"status":"success","data":{"resultType":"vector","result":[]}}
As a result of power outage and network disruption, Juju controller was unreachable for a while. The exporter never recovered from this and stayed down until it was manually started 2 days later, even though the controllers came back in the meantime:
snap services
output before the manual start:Version: