debops / ansible-pki

Bootstrap and manage internal PKI, Certificate Authorities and OpenSSL/GnuTLS certificates
GNU General Public License v3.0
65 stars 29 forks source link

Remove the pain points when managing ACME certificates #117

Open bfabio opened 6 years ago

bfabio commented 6 years ago

Maybe it's just me, but the whole system seems really brittle and every time it breaks it makes me wish I could just run certbot and be done with it.

/usr/local/lib/pki/pki-realm is a ~2500 lines bash script and it's a pain to debug when something goes wrong.

This could be an umbrella bug to improve the whole experience. I think the main points to tackle are:

drybjed commented 6 years ago

I agree, it's a mess. The whole PKI realm concept was created to provide a standardized way to access the X.509 certificates by services - one concept being a support for multiple Certificate Authorities. ACME support was kinda-sorta bolted on there, when it works, it works, but getting it to work might be a pain sometimes.

I think that the whole concept of a PKI realm should be moved out of an Ansible role into its own separate project. Python comes to mind first, but maybe Go would be easier to handle? I'm not sure yet. Having PKI realm as a separate project could help its development - splitting parts of it as separate plugins, one of which would be support for ACME.

bfabio commented 6 years ago

@drybjed I'm wondering if making a generic pluggable system is the right thing to do. I would much prefer a system where Letsencrypt is the first class citizen and everything else comes after.

I'm afraid that, while laudable, that approach would penalize the most common setup in favor of a flexibility that a tiny minority of users really need.

Just a thought, I'm not against it, but I think the LE setup should be an unstoppable tank that Just Works(TM) every time.

drybjed commented 6 years ago

@bfabio What about internal networks? Let's Encrypt certificates work well at frontend hosts, webservers, public stuff, but setting up an internal network with LE is unfeasible. Not all hosts are reachable from the public Internet, but still require encrypted communication. Also keep in mind that Let's Encrypt CA has rate limits - I destroy and create hosts sometimes multiple times a day, and relying only on LE certificates I would hit it's rate limits within hours. During DebOps development I don't use Let's Encrypt certificates at all, because all my hosts are on a private network behind NAT, but I still require X.509 certificates to work correctly, in so far that when the roles are deployed in a production, public environment, they work in the same way.

It seems that Let's Encrypt became an instant hit in the webdev/HTTP community. That's great, but what about other services? SMTP, IMAP, LDAP, MQTT, AMQP, are they not first class citizens? Should they just stick to self-signed certs set up by hand? For me all different CA models supported by debops.pki (ACME, external CA, internal CA, selfsigned) are on equal footing here.

From the point of view of an application that uses X.509 certificates, there's no difference between an internal DebOps CA certificates, Let's Encrypt certificates, or any other CA certificates. Currently debops.pki role supports each of these models in the same way. Some of this is currently crude and unwieldy, like dealing with Let's Encrypt errors, but that's just an implementation detail. The script could definitely handle Let's Encrypt issues better (send an e-mail to configured admin address, have better algorithm to handle errors and error.log, hell, change the configuration file to YAML or INI to drop requirement of bash 4.x so that MacOS X users don't have to fiddle with Homebrew bash installation...).

"Unstoppable tank that 'Just Works(TM)'" - as long as the minimum requirements are met (debops.nginx role has configured nginx on the host, the DNS configuration is propagated, host has a public IP address reachable from the Internet, a PKI realm with desired domain is configured), the Let's Encrypt support in debops.pki should 'Just Work". If you have issues, when you resolve them, you can forget about LE certificates. Check the certs on https://debops.org/ website - I haven't messed with them since the host was created, about 1 year now. It "Just Works (TM)". The host updates the certificates by itself, I don't think about it.

@bfabio, you posted a todo list for the changes you would want to see to make the ACME/Let's Encrypt support better. That's great! I'm currently working on updates to the DebOps mail stack - Postfix, OpenDKIM, SPF, OpenDMARC, perhaps rspamd a bit later. If you want to help with debops.pki development, that's great news to me. Looking forward for some pull requests. :-) If you need some clarification about how pki-realm script works, let me know.

amette commented 6 years ago

To me the main thing that makes the ACME configuration unnecessarily complicated is the choice of default Subject/CN (Common Name) and SANs (Subject Alternative Names). The domain name is used as CN, imho it should be the host's fqdn. That is the only thing that can with reasonable certainty be assumed to point to the host. Using the domain name as the CN falls apart as soon as there is more than one host in the domain.

To make Let's Encrypt work the way I expect, I usually put the following into ansible/inventory/group_vars/all/pki.yml:

pki_realms:
  - name: '{{ ansible_fqdn }}'
    acme_default_subdomains: [ ]
    acme: True
    acme_ca: 'le-staging'

With this, it "just works" out-of-the-box, no matter if it is a one-server-domain or a bigger cluster. I wouldn't actually even be scared to use le-live right away, but better safe than sorry. So once this looks good, I can set acme_ca to le-live in the host specific inventory file. The resulting certificate can be used for SMTP, IMAP, XMPP, etc. As soon as I have my service working properly with the fqdn, I can set the DNS for the according subdomain (mx., mail., smtp., imap., xmpp., jabber., etc.) to my machine and add it to acme_domains in host_vars.

Also I tried to set up Let's Encrypt within the 'domain' realm for quite a while, which I eventually realised is just not gonna work out. I think the documentation could be a bit clearer about having to use a dedicated realm for Let's Encrypt.

On the other hand: if the acme-integration would work as explained above (use the fqdn as CN and don't assume any sub-domains), one could just configure the realm 'domain' to use acme and all would work out of the box. I'm not completely sure though what other ramifications this would have as it would effectively kill the internal CA iiuc.

tl;dr: Changing the default values for CA and SAN should make ACME certificates more straight forward to use. No clue about any potentially associated gremlins though.

drybjed commented 6 years ago

At some point I noticed that the choice of adding arbitrary subdomains to ACME certificated by default, namely www. was an issue in cases like this and I changed the default to not include any subdomains. In other words, you can create a FQDN-based realm with ACME certificate like this:

pki_realms:
  - name: '{{ ansible_fqdn }}'

By default, if debops.nginx is set up, and a host has a public IP address, the debops.pki role will try and request an ACME certificate for all configured realms, unless disabled.

The domain realm has ACME specifically disabled, mostly due to the above reasons. Not all hosts managed by DebOps have a webserver configured, and not all of them have public IP addresses, but you still would want connections secured with TLS, right? That's why I don't think that the domain PKI should be removed.

And it's best if you don't mess with the domain PKI but create a separate one - the domain certificates, even if clients don't have their respective Root CA installed, can be used by various services in the cluster for secure communication between nodes. Think LDAP, connections to the remote database from applications, what have you. You can set up custom PKI with ACME certificates on publicly-accessible nodes of the cluster and point the services accessed by the clients to them.

Actually, debops.nginx has a specific support for this use case. If you create a PKI based on the host's FQDN, or host's domain, the debops.nginx role, during configuration generation, will check if FQDN or domain-based PKI exists, and it will be used automatically. In other words, if you create a FQDN PKI and afterwards re-run debops.nginx, it should automatically switch to it if any servers are configured with that FQDN. It can also be any domain name, or host name, of course, not just the values detected by Ansible.

So, the use case you want should be already implemented. Of course for this you need to specifically enable the `{{ ansible_fqdn }}' PKI realm, but due to various rate limits of Let's Encrypt, and other factors mentioned earlier, I don't think that ACME support like this can be enabled by default. Maaaybe, with some more specific logic that enables the FQDN-based PKI in specific situations.

prk0ghy commented 3 years ago

I agree, it's a mess. The whole PKI realm concept was created to provide a standardized way to access the X.509 certificates by services - one concept being a support for multiple Certificate Authorities. ACME support was kinda-sorta bolted on there, when it works, it works, but getting it to work might be a pain sometimes.

I think that the whole concept of a PKI realm should be moved out of an Ansible role into its own separate project. Python comes to mind first, but maybe Go would be easier to handle? I'm not sure yet. Having PKI realm as a separate project could help its development - splitting parts of it as separate plugins, one of which would be support for ACME.

Thank you for your work on this amazing project. I am currently trying to get it to work for me however PKI is a huge pain point (at least for me). Maybe redeveloping the pki in Go is not even necessary since there is already something like that:

https://github.com/smallstep/certificates

maybe we can get this integrated into debops?

ypid commented 3 years ago

@prk0ghy I support this. We should not implement our own certificate management again. I would say it was a solid way to learn how PKI works, both for @drybjed who implemented it and for me spending one month reviewing it. Now that we do understand it, we can compare other solutions better.

https://awesomeopensource.com/projects/certificate-authority seems to be a good list.

drybjed commented 3 years ago

Looking at my 2017 comment from 2021 brings totally new perspective to this issue. :-) The problem with current PKI implementation is that it is "lopsided" and depends entirely on the remote hosts. The environment we can work with on the Ansible Controller is limited, so I did what I could back then and just relied on the remote hosts to provide initial information about the domain(s) we work with, what CA certificate should include, etc.

Today, while working on re-implementing the debops scripts, I imagine that the internal CA part of the pki role would be redesigned to use debops pki subcommand to perform its operations. That way we can implement it in Python and we have control over what is executed on the Ansible Controller. And we can add support for other software as well, such as step-ca. I'm currently swamped by other stuff at work, but hopefully I'll have some free time during summer to work on this more.

One problem is this is finding a way to have internal CA management without the debops scripts installed so that the pki role can still function properly. I guess that we can just provide basic self-signed certificates on the remote hosts and tell the users to install the debops scripts to have fully-fledged internal CA. The remote side could still work independently, handling self-signed and ACME-based certificates.

ypid commented 3 years ago

"Don’t roll your own crypto". There is still #106. There has to be an existing tool we can use.

prk0ghy commented 3 years ago

Would it be useful to compile a list of features the new pki should have? I think it would be easier to implement if we know exactly what it should be able to do.

drybjed commented 3 years ago

Here are some things I would like to address from the current pki role included in DebOps:

prk0ghy commented 3 years ago

@ypid I went through the list and came up with these candidates. Although I think step-certificates and step-cli is the way to go.

https://github.com/NLnetLabs/krill https://github.com/letsencrypt/boulder https://github.com/dogtagpki/pki/wiki/Certificate-Authority https://github.com/cloudflare/cfrpki https://github.com/fm4dd/webcert https://github.com/cloudflare/cfssl https://github.com/smallstep/certificates