Obtaining (additional) certificates in the background at startup?

MarcelWaldvogel commented 3 years ago

After adding some domains to DOMAINS, it can take quite some time until the new certificates are available and normal operation starts (or resumes, if you did a restart after changing DOMAINS).

Would it be possible to do the ACME requests while nginx is already serving the remainder of the domains? Where would be a good place to add this?

It seems that just backgrounding this and the running /bin/reconfig or /bin/renew does not work. (At least it did not work for me when https-portal started up with one certificate missing, as the DNS entry was not yet available. Running the cron job did cause the new certificate to be generated, but nginx configuration did not switch to the Let's Encrypt certificate. (I did not investigate thoroughly then, so maybe I just did something wrong.))

SteveLTN commented 3 years ago

It's a complex task. I tried it once but couldn't manage to do it without breaking other things.

One tricky thing is that I would like Nginx process to be managed by s6-overlay. So when it crashes, the whole docker container would crash as well, allowing it to be monitored/restarted. That means after the setup is done, it needs to stop its subprocess nginx, and let s6-overlay launch another instance of nginx.

If you look at the code here, the domains are signed one after another in setup phase. So are their configuration files. I image if we keep the configuration files in a data volume, they won't be dropped on next launch. However, errors from the previous sessions could live on and prevent the Nginx from starting all together, which could be problematic especially when we have code changes internally.

Therefore I once decided it was too tricky a thing to do.

MarcelWaldvogel commented 3 years ago

Thanks for the information. I'll try to have a look at it in the upcoming weeks, but can't promise anything. Maybe a good idea pops up…

SteveLTN commented 3 years ago

Thanks. Feel free to communicate with me early if you come up with any good ideas. On this subject I don't want to compromise other aspects such as upgrade-compatibility and bringing down the whole container when there are errors.

MarcelWaldvogel commented 3 years ago

upgrade-compatibility

I do not think that this will affect backward compatibility; i.e., anything that worked before should continue to work.

bringing down the whole container when there are errors

Just to be sure: Do you want or not want to bring down the whole container when there are errors?

SteveLTN commented 3 years ago

Yes, I want to bring down the whole container when there are errors.

I do not think that this will affect backward compatibility; i.e., anything that worked before should continue to work.

In the new way of setup, I image HTTPS-PORTAL will try to run on old configurations as it doesn't know which domain will take longer to obtain the certs. Then it starts to loop through the domains, for each domain:

it creates new HTTP configurations for ACME verification
it tries to obtain new certificates if necessary
it creates new HTTPS configurations using either new or old certs
It sends a signal to reload Nginx

Imagine due to some bugs an old version of HTTPS-PORTAL generated some wrong configurations files for xxx.eample.com. Now we updated HTTPS-PORAL and want to fix the bug. But since HTTPS-PORTAL will try to launch Nginx with old configurations before trying to obtain new certificates, it can fail to launch.

I think at least we need to hide this feature behind a flag. By default, HTTPS-PORTAL should just ignore all older configurations and create new ones from scratch. Then if you choose to add new sites seamlessly, you need to select that by a environment variable.

MarcelWaldvogel commented 3 years ago

Thanks for the insights.

How about the following workflow? I am trying to get service for as many domains up as fast as possible.

Start with a default set of DH parameters (these are only the DH parameters, not keys). The file has a magic string somewhere, indicating it needs to be replaced (e.g. REPLACE_ME outside the -----BEGIN/END DH PARAMETERS----- block).
Create HTTP configurations suitable for both its normal operation as well as ACME verification at the same time. For each domain, it creates a configuration for:
- Port 80 for HTTPS redirect plus .well-known handling
- Port 443 for HTTPS serving, with the Let's Encrypt certificate, if available (fallback: self-signed)
This is fast, only local operations, no expensive cryptographic operations.
Create self-signed certificate (do we actually need this?)
Start nginx
Loop through all the domains without certificate and create them. After each new certificate, tell nginx to reload (there is no domain-specific reload, as far as I can tell).
Loop through all the domains with certificates which will expire soon or already have expired and renew them. Start with the oldest certificate. After each certificate which already has expired or where expiry is imminent (e.g. within the next hour or so), reload immediately. Reload at the end, if certificates have been updated but not reloaded yet.
If the default set of DH parameters is still in use (looking for the magic string, as described above), create a local set and reload.

Would that be a good workflow or did I miss anything?

SteveLTN commented 3 years ago

Sounds like a good idea.

I have a few questions:

1) we still need to be able to bring the container down when there are failures. AFAIK, when signaling Nginx to reload, if it fails, it just don't do anything, but will not crash.

2) why do we need to replace the first DH parameter?

3) We already have a self-signed certificate. That was used to power the default server block. We can reused it:

for each domain:
1. check if it has ssl certificate. If not, copy the self-signed to the target folder
2. for each domain, create Nginx configs for both 80 and 443
start Nginx
for each domain: if certs are self-signed, or to expire soon, or invalid, re-sign from Let's Encrypt and send reload signal

I don't think we need to start from the oldest. Don't think the benefit would be huge.

SteveLTN commented 3 years ago

Actually, I realize the step 3.iii above is exactly the renew process.

configure Nginx (both HTTP and HTTPS) with either existing certificate or self-signed certificate
start Nginx
immediately trigger a renew process.

I still want renew to bring down the container though. I think maybe we can send a signal to kill Nginx if it fails?

I feel like I need to do this by myself. After all, I need to maintain it. I'll do some experiments in the weekend and if I have something working I'll give you an alpha version to test.

SteveLTN commented 3 years ago

I release a new beta version, 1.18.0-beta.

This version basically does what I said in the previous comments. Do you want to try it out?

I still let setup process finishing obtaining all certificates and stop its subprocess Nginx. Then let s6-overlay take over and maintain its own version of Nginx. The is a very short outage when handing over the control of Nginx.

One way I can think of is to modify setup so it keep running all the time, and maintain its subprocess Nginx. This way, we don't need to have the step of stopping Nginx briefly. However, the ruby process will always be in memory, which I think is not a tradeoff I want to make. What do you think?

MarcelWaldvogel commented 3 years ago

Great, thanks!

Have been trying it on one machine, and worked there. However, on the other machine, it fails with

nginx: [emerg] SSL_CTX_use_PrivateKey("/var/lib/https-portal/xyzzy.example.ch/production/domain.key") failed (SSL: error:0B080074:x509 certificate routines:X509_check_private_key:key values mismatch)

Will look into it later on a machine where I can test better.

SteveLTN commented 3 years ago

This error message is because Nginx finds that the key and cert don't match. Can you paste the whole log of this domain? I'd like to know on which stage it got wrong, if it's using self-signed key together with real certs, or the other way around.

You can have environment variable DEBUG=true to have slightly more info.

Thanks!

MarcelWaldvogel commented 3 years ago

On a fresh machine, it did not occur in my tests; i.e., under normal operation, everything works.

However, I did find the actual logs in the scrollback buffer and could recreate it: If ACME verification fails, then the new key is already in place, but not yet the certificate (in my case, it was caused by a not-yet propagated DNS entry for that virtual server):

reverse-proxy_1  | Verifying noxyzzy.example.fr...
reverse-proxy_1  | Traceback (most recent call last):
reverse-proxy_1  |   File "/bin/acme_tiny", line 198, in <module>
reverse-proxy_1  |     main(sys.argv[1:])
reverse-proxy_1  |   File "/bin/acme_tiny", line 194, in main
reverse-proxy_1  |     signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca, disable_check=args.disable_check, directory_url=args.directory_url, contact=args.contact)
reverse-proxy_1  |   File "/bin/acme_tiny", line 149, in get_crt
reverse-proxy_1  |     raise ValueError("Challenge did not pass for {0}: {1}".format(domain, authorization))
reverse-proxy_1  | ValueError: Challenge did not pass for noxyzzy.safebits.fr: {u'status': u'invalid', u'challenges': [{u'status': u'invalid', u'url': u'https://acme-staging-v02.api.letsencrypt.org/acme/chall-v3/205390078/DYqvSA', u'token': u'2JAtbCXyoMcaQBjBepLcSzLom_rnVC4ylY-2wPB89ko', u'type': u'http-01', u'error': {u'status': 400, u'type': u'urn:ietf:params:acme:error:dns', u'detail': u'DNS problem: NXDOMAIN looking up A for noxyzzy.example.fr - check that a DNS record exists for this domain'}}], u'identifier': {u'type': u'dns', u'value': u'noxyzzy.safebits.fr'}, u'expires': u'2021-02-15T15:44:21Z'}
reverse-proxy_1  | ================================================================================
reverse-proxy_1  | Failed to sign noxyzzy.example.fr.
reverse-proxy_1  | Make sure you DNS is configured correctly and is propagated to this host 
reverse-proxy_1  | machine. Sometimes that takes a while.
reverse-proxy_1  | ================================================================================
reverse-proxy_1  | Failed to obtain certs for noxyzzy.example.fr
reverse-proxy_1  | [cont-init.d] 20-setup: exited 1.
reverse-proxy_1  | [cont-init.d] 30-set-docker-gen-status: executing... 
reverse-proxy_1  | [cont-init.d] 30-set-docker-gen-status: exited 0.
reverse-proxy_1  | [cont-init.d] done.
reverse-proxy_1  | [services.d] starting services
reverse-proxy_1  | [services.d] done.
reverse-proxy_1  | nginx: [emerg] SSL_CTX_use_PrivateKey("/var/lib/https-portal/noxyzzy.example.fr/staging/domain.key") failed (SSL: error:0B080074:x509 certificate routines:X509_check_private_key:key values mismatch)

I guess the solution would be to store the new private key in a temporary file besides the self-signed one and only activate it after the challenge succeeds.

SteveLTN commented 3 years ago

Oh, now i know. It is because signing failed. Therefore the new private key isn't matching the self-signed certificate, causing Nginx refusing to start/reload. Yes, your suggestion is absolute correct. We should keep the on-going private key separate, and only put it into the right place after the certificate is obtained. I'll find some time to fix it. Worst case this weekend.

SteveLTN commented 3 years ago

@MarcelWaldvogel I did some update, now using an domain.ongoing.key before signing is successful.

Also I downgraded S6-overlay to 2.1.0.2, because there is currently a bug that prevents the container to stop when setup exits on error.

P.S. I didn't do a release this time. I figured you'll just build by yourself anyway.

MarcelWaldvogel commented 3 years ago

Thanks! And yes, you seem to know me well by now :wink:; I am currently staying at my hand-made builds, but will update to the next release whenever it is done and I make some setup changes for our systems.

SteveLTN commented 3 years ago

Seems we've been testing it long enough. I merged the code into master, and released 1.18.0.

SteveLTN / https-portal

Obtaining (additional) certificates in the background at startup? #261