18F / api.data.gov

A hosted, shared-service that provides an API key, analytics, and proxy solution for government web services.
https://api.data.gov
Other
98 stars 43 forks source link

Fix auto-SSL domains not getting renewed #380

Closed GUI closed 7 years ago

GUI commented 7 years ago

We have a couple agency subdomains that are managed with our automatic SSL setup that haven't been renewed on the expected schedule. The certs are still valid, but one cert will expire in 18 days, the other in 26 days. Both of these should have already been renewed automatically.

After looking at logs, it looks like our renewal script has been timing out during the renewal process. While I'm not completely certain why the renewal has been timing out, there have been several fixes in the lua-resty-auto-ssl library since we last upgraded our servers. One of the issues was also related to the timeout not being as long as expected (https://github.com/GUI/lua-resty-auto-ssl/issues/11#issuecomment-265650064). So we should upgrade lua-resty-auto-ssl, which I think will probably address this.

GUI commented 7 years ago

The 2 domains are now renewed, so we should be all set (and we should have gotten a fixed notification from the new monitoring setup in https://github.com/18F/api.data.gov/issues/379).

Although, the issue was potentially a bit more nuanced, so in case we see some of these same issues crop up again, here's some more detailed notes:

GUI commented 7 years ago

Re-opening this, since this cropped up again for our USDA.gov subdomains (but this time the issue is consistently happening, so it's a bit easier to track down).

For a brief summary and background of what's happening:

Let's Encrypt has a useful page outlining the ins and outs of these CAA records: https://letsencrypt.org/docs/caa/ They note timeout issues are likely an issue with the DNS's authoritative name servers:

Sometimes CAA queries time out. That is, the authoritative name server never replies with an answer at all, even after multiple retries. Most commonly this happens when your nameserver has a misconfigured firewall in front of it that drops DNS queries with unknown qtypes. File a support ticket with your DNS provider and ask them if they have such a firewall configured.

After some testing, the issue does seem that CAA queries against usda.gov's name servers time out for some reason. So while ideally this could be resolved on usda.gov's end, I'm not sure how feasible getting all that fixed is.

But luckily, we now have a way to sidestep this issue, since we can now handle the CAA lookups on our end. As Let's Encrypt mentions, CAA validation follows CNAMEs:

CAA validation follows CNAMEs, like all other DNS requests. If www.community.example.com is a CNAME to web1.example.net, the CA will first request CAA records for www.community.example.com, then seeing that there is a CNAME for that domain name instead of CAA records, will request CAA records for web1.example.net instead.

So this means we can setup CAA records on our end for the domains that each agency CNAMES to. This wasn't possible until a couple days ago when Route 53 added support for CAA records, but luckily the timing of all this worked out, and adding CAA records on our end does indeed solve this and allow us to renew SSL certificates for usda.gov subdomains.

GUI commented 7 years ago

As noted above, since we can now add CAA records in Route 53, so I think this should really be fixed now. We can now sidestep any CAA misconfiguration issues on agency's root domains by defining CAA records for our CNAME domain. I've added additional documentation on adding these CAA records as part of our subdomain setup, so this should be taken care of moving forward: https://github.com/18F/api.data.gov-ops#setting-up-agency-subdomains And as a bonus, defining CAA records for all of our CNAME domains is a good idea anyway (both from a security perspective and in case agencies start adding CAA records of their own).

And one small thing to note mainly for reference is that we need to add CAA records to each of our agency-specific DNS records. We can't just add a single CAA record to something like domains.api.data.gov, since Let's Encrypt does not implement "tree-climbing" to check parent domains for the underlying CNAMEd value:

The CAA RFC specifies an additional behavior called “tree-climbing” that requires CAs to also check the parent domains of the result of CNAME resolution. Let’s Encrypt does not implement tree climbing because it makes expressing certain CAA policies impossible. After discussion on the IETF mailing list, we achieved consensus that tree-climbing in CAA is not ideal, and there’s an erratum for the CAA RFC removing it.