csmith / centauri

TLS-terminating reverse proxy in Go
MIT License
1 stars 1 forks source link

Better behaviour if lego provider fails to be created #99

Open csmith opened 1 week ago

csmith commented 1 week ago

At the minute the default for certificate-providers is lego selfsigned. If the lego supplier fails to start up[^1], a warning will be logged but everything will carry on as normal and eventually all certs will be replaced with self-signed ones. The only indication that anything is wrong in the logs is a single warning on startup.

This is the desired behaviour if you aren't using lego (e.g. when using selfsigned, or when using tailscale): you simply don't configure lego and it fails through to self-signed. It's very much not desirable if you do intend to use lego and it fails though.

Some options:

1. Explicitly check on startup if the acme-email setting is valid, and if it is bail out if the lego supplier fails.

I don't like this as I'd prefer Centauri to carry on working as much as it can for as long as it can. If it has a bunch of saved certificates valid for 89 days, it seems stupid to refuse to start up if the ACME service is down.

2. Try to create the supplier every time it's needed

This seems like it will end up spamming the ACME service if the problem is with the config, and I'm not confident we can actually tell from the errors we get through lego whether the error is caused by us or not.

It would also make a complete mess of the architecture as the cert manager would need to be able to create suppliers, but that's probably solvable.

3. Change the default setting to just lego and error out if the specified providers don't exist

See option one for arguments against bailing out

4. Add more logging when checking a cert if a preferred supplier isn't available

Probably a good idea, but still hides the problem away in the logs. Also a little spammy if you're not intending on using lego, but left everything as defaults.

5. Do nothing, wait for the healthcheck endpoint (#23) to exist and let that sort it out

Healthcheck endpoint would at least expose the problem in a nicer way than a single warning in the logs, but still means there's some OOB thing you have to look at to figure out everything is going wrong. (But I think that's always going to be the case when I want the behaviour to be "serve as much as possible at all costs"...)

[^1]: which I previously assumed would only happen due to config issues, but can actually happen if there's no network connectivity when Centauri starts (or the ACME provider is down)

ShaneMcC commented 1 week ago

Number 2 seems like the best approach to me - assuming that is "once per renewal cycle", not once per certificate then it would at least only try every 12 hours so not too spammy