debops / ansible-pki

Bootstrap and manage internal PKI, Certificate Authorities and OpenSSL/GnuTLS certificates
GNU General Public License v3.0
65 stars 29 forks source link

Redesign 'debops.pki' role, add external CA and ACME support #33

Closed drybjed closed 8 years ago

drybjed commented 8 years ago

Currently role does not have all of the planned features and documentation is missing, but the base is usable and internal CA should work. There will be more commits in the future before this PR is merged.

drybjed commented 8 years ago

The playbook that will be used to run this role is a little different, something like this:

- name: Manage Public Key Infrastructure
  hosts: [ 'debops_all_hosts', 'debops_service_pki' ]
  become: True

  roles:

    - role: debops.pki/env
      tags: [ 'role::pki', 'role::secret' ]

    - role: debops.secret
      secret_directories:
        - '{{ pki_env_secret_directories }}'
      tags: [ 'role::secret' ]

    - role: debops.pki
      tags: [ 'role::pki' ]

This arrangement is done so that debops.secret can prepare directories needed later by debops.pki.

sread commented 8 years ago

This seems great! I have a question about how ACME certificates and subdomains will be handled: currently I used acme-tiny to generate one certificate for each sub-domain. I'm not tied to this approach but my understanding is that Let's Encrypt doesn't support wildcard certs.

drybjed commented 8 years ago

At the moment Let's Encrypt doesn't support wildcards, last I've heard it is planned, but who knows when it will be available.

For ACME integration, I want to use acme-tiny as well. I imagine that the trigger enabling ACME should at least be the debops.nginx role with enabled ACME support - otherwise there's no way for the acme-tiny script to publish the challenge to the server. Another factor should be availability of a public IP address (not a private one, not link-local, something that is reachable by Let's Encrypt servers). If a host is not publicly routable, there's no reason to call Let's Encrypt - firewall could forward the packets to an internal host, but that internal host doesn't know that.

So, let's say that stars aligned, we have nginx configured and a public IP. In that case, role initializes ACME support in each PKI realm by installing acme-tiny, creating the acme/account_key.pem private key and requesting the certificate using a separate pki-acme user account, which should return the certificate to acme/cert.pem along with intermediate and root certificates. After that, the pki-realm should switch the public certificates to ACME ones and call hooks (not implemented yet) to notify daemons as needed, so for example nginx can reload its configuration (not sure if that's needed, but probably yes).

I'm not sure if you noticed, but if you use dots in a PKI realm name, it will be treated as a domain and used accordingly. if you want to check what kind of requests can be generated, try this:

pki_realms:
  - name: 'example.com'
    domains: [ 'example.org', 'example.net' ]
    acme_subdomains: [ 'www', 'mail', 'xmpp' ]

This should be enough to generate valid ACME CSR. I wonder about the scalability of passing thousands of subdomains through the Bash arguments, but for a small scale it should be fine...

drybjed commented 8 years ago

Progress update: ACME is in! Role checks if a public IP address is available and nginx is installed with configured ACME support (debops.nginx sets everything up). If everything is in place, live Let's Encrypt certificates will be requested automatically, and enabled if request was successful.

By default only www. subdomain is specified as SAN for Let's Encrypt certificates, you can specify more using the item.acme_subdomains list in PKI realm configuration (see above).

If you want, you can switch to Let's Encrypt staging CA by setting:

pki_acme_ca: 'le-staging'

in the Ansible inventory. At the moment ACME certificates are not renewed automatically and services are not restarted on certificate change, I'll implement that later.

Have fun. :-)

sread commented 8 years ago

Okay, I gave it a first crack, but no luck so far. I had to force pki_acme: True in my inventory since for some reason ansible isn't returning my server's public IP in ansible_all_ipv4_addresses. Something to do with the peculiar floating public IP system my VPS provider uses perhaps.

Two things: 1) when I run the playbook suggested above, env/tasks/main.yml tries to reference the pki_env_secret_directories.j2 template, but it looks in env/tasks/lookup instead of env/templates/lookup 2) I'm not sure the if the realm logic is working right: I'm getting 3 pki:known-realms: "example.com", "service", "domain" , and pki:realm is set to "domain" instead of what I would expect, "example.com" . I tried removing both the existing pki secrets and pki facts.d but no change.

drybjed commented 8 years ago

@sread Do you have only private IP addresses? In that case you need to force ACME, yeah. Hopefully server can be accessed from the outside.

As for the lookup issue, it might be related to a bug in Ansible older than 2.0, where lookup("template") searched for templates in files/ directory instead of templates/. I guess that I could make a symlink... But since Ansible 2.0 has been released as stable, can you check if it works on that?

PKI realms are now split among multiple variables - pki_realms, pki_group_realms, pki_host_realms, pki_default_realms. Depending on set of variables you use you can have different realms on different hosts.

The default realm is set by pki_system_realm variable (the corresponding variable from the old version has been renamed to not collide with the new pki_default_realms). If you haven't changed the default setting, the domain realm will be used as the default one.

ansible_local.pki.known_realms contains a list of realms configured on the remote host, this can be used by other roles to check if a specific realm is available and decide (or not) to use it, like this:

when: realm_name in ansible_local.pki.known_realms
drybjed commented 8 years ago

@sread Any progress with the new role? Did you made it work?

sread commented 8 years ago

Not yet, but progress I think. I have the pki_host_realms set using the example above, and updated pki_system_realm to point to it. However, the /etc/pki/realms/example.com folder is missing the default.crt and default.key files which the default webserver in the nginx role expects (and so nginx will not load). Perhaps the update_symlinks commands are not working if the file doesn't already exist?

I haven't had a chance to install ansible 2.0 yet, I'll let you know about that when I do. I definitely have a public IP but it's not accessible to ansible as far as I can tell.

sread commented 8 years ago

Okay I bit the bullet on Ansible 2.0, and I can confirm that the lookup bug is not present. So I guess debops.pki will have an Ansible 2.0 requirement :)

drybjed commented 8 years ago

WIth how it goes right now, I guess the role should be ready to merge (apart from docs) in a few days. I'm not sure if the lookup issue is the only one that requires Ansible 2.0 (if you want, you can try symlinking templates/lookup/ to files/lookup/ and see if role works in Ansible 1.9), but perhps since Ansible 2.0 is officially stable, we can move to it full time, what do you think? In the meantime I would like to find and polish bugs with the role.

First of all, do you have domain and service realms set up correctly? They should be based on your host's default domain, so if you use that you should be good to go.

The pki-realm script (the one that works on remote hosts) tries to keep a clean realm, doesn't create missing symlinks, but if you don't have default.key then something's wrong since that one is created on the remote host itself. Make sure that you have current version of the PR checked out. Can you show me output of tree /etc/pki/realms/example.com as well as contents of the config/realm.conf file? That should give some hints about what could be the issue.

drybjed commented 8 years ago

You can use the playbook below to check what IP addresses Ansible will detect as "public". If you have any IP addresses in the output but role still doesn't want to manage ACME automatically, let me know:

---
- hosts: all

  vars:
    ip_list: '{{ (ansible_all_ipv4_addresses +
                  ansible_all_ipv6_addresses) | ipaddr("public") }}'

  tasks:
    - name: Show public IP addresses
      debug:
        msg: 'Public IPs: {{ ip_list | join(", ") }}'
drybjed commented 8 years ago

@sread Here are some more ideas which could help you with debugging. Obviously all of this will be added in the documentation, but since there's none right now, here we go:

What OS do you use on Ansible Controller? The Certificate Authorities are executed there, important pieces of the puzzle are Bash (minimum 4.2.2, I think) and OpenSSL, rather recent version. If you use MacOS X you will probably need to upgrade, recent Debian or Ubuntu should be OK.

debops.pki uses a session token (a random string) to ensure that the certificate request uploaded to the CA is genuine - a random challenge password is added to the request via the config file. However, since you hat issues that prevented the role from finishing, I bet that the configuration file and request were generated, sent to the Ansible Controller, and some issues happened. The thing is, right now this state is not reset automatically, to do that you need to remove internal/gnutls.conf and internal/request.pem files from the realm directory to perform a "reset". I think that I can detect that state and reset the realm automatically, I'll check that tommorow.

To debug the realms, login to the remote host and on root account run:

bash -x /usr/local/lib/pki/pki-realm run -n realm-name

This should make the script do all the self-checks, create missing files and symlinks, and so on.

To debug the CA, on Ansible Controller cd into the secret/pki/ directory and run:

bash -x ./lib/pki-authority sign-by-host <host-fqdn>

This command should sign all incoming requests and prepare the files to download to the remote hosts. It's not as comprehensive as pki-realm run but should help you find any issues with the CA.

sread commented 8 years ago

Quick Answers: Debian testing with Ansible v2.0.0.1 from experimental

I removed the example.com setting and deleted pki.fact and /etc/pki/realms on the host. Then I ran the pki playbook from above: now I have a domain realm properly configured (all default files seem to be there). nginx now starts, although something is still not quite right.

I think for the acme stuff I was trying to do something weird: I wanted to only get a acme cert for the subdomain (sub.example.com), as the domain (example.com) DNS doesn't point to this server. But I notice in the acme/error.log (which wasn't there before) that it is failing testing for control of example.com (as it should).

I can't change control of example.com to this server yet, but is there a way to test this role on another secondary domain I do control, like example.net? Then I could provide more useful testing.

Also, sorry if this is not an acceptable use-case, but I would like to be able to generate only the subdomain as above.

drybjed commented 8 years ago

OK, so... ACME requests won't be performed unless you remove the acme/error.log file - it's designed that way so repeated failed attempts won't exhaust your limits in Let's Encrypt, so be wary. :)

I'm not sure that Let's Encrypt allows to register only subdomains, I think that the CA checks main apex domain as well. I guess you could ask on #letsencrypt IRC channel or their forum to confirm. I can do that rommorow (2 AM here, going to sleep).

For the example.net domain, you should be able to just create the realm similarly to the example above. By default debops.pki will try to request LE certificates from all realms, but since your host doesn't do that by default, you can just not enable ACME for all realms, but just for the specific ones:

pki_host_realms:
  - name: 'example.net'
    acme: True
    acme_subdomains: [ 'www', 'sub' ]

This should create correct request, you can check it out with:

openssl x509 -in acme/request.pem -text -noout

Obviously for ACME requests to pass successfully, Let's Encrypt needs to have access to your host's web directory, this concerns the webserver (which should be covered by debops.nginx) as well as your DNS.

drybjed commented 8 years ago

BTW, you can obviously make realms with more than 2 DNS levels, like so:

pki_host_ralms:
  - name: 'internal.example.net'
sread commented 8 years ago

Okay I like that design for acme/error.log. Don't want to waste requests.

I can confirm that sub.example.com requests are allowed without control of example.com as I have already successfully generated a cert for a subdomain previously by manually using acme-tiny.

Okay I can test the host_realm example however I'm not quite sure how to disable acme support for other realms without completely turning acme off (which is what pki_acme seems to do). I'll look.

...We are in very different time-zones!

sread commented 8 years ago

Okay making some real progress now, I think I'm starting to understand the logic behind the pki realms.

Setting up a pki_host_realms for example.net (secondary domain) worked perfectly, even with the main example.com failing/failed. Had to specify pki_realm in the nginx_servers block.

Setting up another host_realm for sub.example.com, separate from the domain realm seems like the right approach, but I can't seem to set acme_subdomains and acme_default_subdomains to be empty: I don't want www.sub.example.com and it's not active in DNS so the acme challenge obviously fails.

Overall, I think this is a great improvement to the pki system, I'd be happy to help edit/write some documentation, especially something explaining the logic behind the pki realms.

drybjed commented 8 years ago

@sread Great!

In the previous version of the role, the concept of "realms" was to allow for different services to use dfferent sets of certificates without the need to point to the .crt and .key files separately. It was not intuitive, so I changed that so that each realm handles only one set of a key and certificates bound to that key, each certificate from different CA (internal, external, ACME, could be others in the future).

Now, multiple PKI realms allow you to create different certificates for different domains (just as you now did), and each service can use different realm. The default realms are also valid, they use ansible_domain and ansible_fqdn variables as a base. Additionally, when the new debops.pki role is merged, I would like to add code in debops.nginx that will automatically pick the realm with the same name as the domain the config is using, if it's available.

I saw your attempt at the docs on the wiki, very nice although not complete. For example take note, that when you set up a new host, debops.pki is executed long before debops.nginx is installed, which means that you will need to run service/pki playbook at least once afterwards if you want ACME certificates.

I've noticed the issue with the default subdomains and no way to "reset" them, but I think that I have a solution to this and I'll try to implement it.

sread commented 8 years ago

Okay, I got sub.example.com without www.sub.example.com working, but only by manually editing acme_subdomains and acme_default_subdomains in config/realm.conf on the host. So I still don't know the proper inventory variable for that configuration.

As a note for others: to successfully request a Let's Encrypt cert after a failed request, you must remove both acme/error.log and acme/request.pem. In addition, if you've changed config/realm.conf, you also have to remove acme/openssl.conf so it is regenerated.

drybjed commented 8 years ago

@sread You should now be able to disable the default domain and subdomains from the certificates, like this:

pki_realms:
  - name: 'example.com'
    acme_default_subdomains: []
    acme_subdomains: []

At the moment I think that I've finished with the set of arguments, I think now's a good moment to find out which ones could be removed. The issue with that is that scripts have several configuration sources:

Next up, add scheduling for renewing the ACME certificates... And I guess documentation, if nothing else crops up. Do you think that a short document explaining the PKI concepts (how certificates are created, how it's all used, etc.) would be useful to get the general gist of how X.509 works, if someone does not know it, or should I skip it and write only about the implementation in debops.pki?

sread commented 8 years ago

Very nice! I tested with an existing realm and with a new realm with no subdomains. It works, although I find it a bit fragile: the first run I had a few failures (forgot to run the special playbook to setup the private folders instead of service/pki, connection failure) and I was left with a realm.conf that went back to the default www subdomain. However, when I removed the new realm from the host and ran the correct playbook, it worked perfectly first run.

A few other thoughts:

Thanks!

sread commented 8 years ago

Just read the new debops-playbooks "Getting Started" docs, they are fantastic! All the problems I encountered are described well there. You can ignore my above comment about the domain not being set, it's explained sufficiently.

drybjed commented 8 years ago

@sread For the startup variables in the scripts, I try to make them the same as in defaults/main.yml file.

The common.yml playbook will also be updated to include the debops.pki/env "subrole", here's the most likely version: https://gist.github.com/drybjed/00ec651319990567463e And yeah, the debops.pki/env and debops.secret roles need to be executed before debops.pki to create missing directories, otherwise role will error out when it tries to copy files to remote hosts.

The debops.nginx role, and perhaps others, will check if a PKI realm with the same name as the domain you set up is available - if it is, certificates from that will be preferred, otherwise the default realm will be used.

sread commented 8 years ago

That all sounds great to me.

drybjed commented 8 years ago

@sread I've added a bit of the documentation (general explanation of PKI realms, how ACME integration works). You might want to check it out and see if anything else could be added.

sread commented 8 years ago

Had a look over the documentation, it's great. It does a really nice job of explaining the overall pki model and many of the pain points I encountered during testing. I also like the folder trees showing expected layout at different stages of configuration, that is a great troubleshooting resource.

Here are a few points I thought were unclear:

Minor nitpick in pki-realms.rst, should be "its simplified directory structure" not "it's" :) And "To avoid possible conclusion," I think is supposed to be "confusion".

Thanks

drybjed commented 8 years ago

@sread Thanks for the spellchecking :-). It's a good idea to point to different parts of the documentation, I'll look into that.

As for item.acme_default_subdomains and item.acme_subdomains, yes, you still need to override both of them - problem with these comes from the fact that the script tries to merge two configuration sources together - one from the defaults in the script itself, and one from Ansible configuration. In the main role defaults you can find the pki_acme_default_subdomains which defines the "base" value which can be overriden by item.acme_default_subdomains if necessary per PKI realm. I imagine that item.acme_subdomains might be used in the inventory more, if you set up specific realms for each website. I suppose it's still kinda complicated... Right now I want to finish the documentation and merge the new role into master, so that more users can try it easier. I imagine there might be some changes before 2.0 is released, we'll see.

sread commented 8 years ago

Yes, I'm fine with the behaviour of needing to override both: I think my use-case of needing to drop the www. default is rare enough to handle that way. I only meant to point out that the documentation should reflect that. Right now it suggests that item.acme_subdomains is sufficient to completely override and control the subdomain list and that is not strictly true.

drybjed commented 8 years ago

@sread Good point, I'll update the explanation to reflect that.

drybjed commented 8 years ago

@sread Just a heads up, I've added more documentation and cleaned up default variables. Not everything yet is documented, but i think that role works fine and is good for merging, so that more people can test it. What do you think? Should I merge it now and add docs later, or documentation is more important before the new code is used? The merging will require updates to debops-playbooks so that needs to be coordinated as well.