The exporter never recovers from a single failure to connect to Juju controller

przemeklal commented 1 year ago

As a result of power outage and network disruption, Juju controller was unreachable for a while. The exporter never recovered from this and stayed down until it was manually started 2 days later, even though the controllers came back in the meantime:

2023-03-25T22:35:38Z prometheus-juju-exporter.prometheus-juju-exporter[24678]: ERROR:prometheus_juju_exporter.exporter:Collection job resulted in error: Unable to connect to any endpoint: <redacted_ip>:17070
2023-03-25T22:35:38Z systemd[1]: snap.prometheus-juju-exporter.prometheus-juju-exporter.service: Main process exited, code=exited, status=1/FAILURE
2023-03-25T22:35:38Z systemd[1]: snap.prometheus-juju-exporter.prometheus-juju-exporter.service: Failed with result 'exit-code'.
2023-03-27T07:24:32Z systemd[1]: Started Service for snap application prometheus-juju-exporter.prometheus-juju-exporter.

snap services output before the manual start:

root@juju-1:~# snap services
Service                                            Startup  Current   Notes
prometheus-juju-exporter.prometheus-juju-exporter  enabled  inactive  -

Version:

prometheus-juju-exporter  1.0.1     31     latest/stable  canonical✓  -

mkalcok commented 1 year ago

Possible solution would be to include following in snapcraft.yaml's definition of the service:

    restart-condition: always
    restart-delay: 5s

Service defined like this will keep restarting forever and will come back online once the controller is again reachable. This will likely have an impact on the accuracy of status of the unit reported by juju status as it will have high chance of checking service status sometimes between service is restarted but before it crashes (this could be partly mitigated by higher restart-delay).

Another impact that needs to be considered is what it'll do to Prometheus scrapes. In the brief moment between service start and crash, the /metrics endpoint returns empty response (with 200 code). I'll test what it does to Prometheus metrics and report back.

Note: It's important to include restart-delay to prevent service reaching StartLimitBurst within StartLimitIntervalSec with fast restarts. If the burst limit is reached, systemd will no longer try to restart the service. There's no way to set these two parameters directly from snapcraft.yaml so our only option is set restart-delay to not reach default restart limits ~~(10 times in 5 seconds)~~ (5 times in 10 seconds).

agileshaw commented 1 year ago

I'd use on-failure rather than always for the restart condition to: 1) further minimize the possibility of hitting unit start rate limit; 2) for better practices. Per systemd documentation:

Setting this to on-failure is the recommended choice for long-running services, in order to increase reliability by attempting automatic recovery from errors. For services that shall be able to terminate on their own choice (and avoid immediate restarting), on-abnormal is an alternative choice.

mkalcok commented 1 year ago

I don't mind either way. It does not make much difference for this particular service as the only way to exit "cleanly" is with KeyboardInterupt and that's really more for the development purposes, regular user would not interact with the service this way. If service is stopped with systemctl stop, systemd won't attempt restarting even it the restart policy is always.

sudeephb commented 1 year ago

I created a PR #41 for this. I didn't see any unintended impacts on the Prometheus scrapes when the response was empty. The results of the query for juju_machine_state were just empty:

ubuntu@juju-96f59e-pje-1:~$ curl "http://127.0.0.1:9090/api/v1/query?query=juju_machine_state"
{"status":"success","data":{"resultType":"vector","result":[]}}

canonical / prometheus-juju-exporter

The exporter never recovers from a single failure to connect to Juju controller #40