Closed GUI closed 7 years ago
The 2 domains are now renewed, so we should be all set (and we should have gotten a fixed notification from the new monitoring setup in https://github.com/18F/api.data.gov/issues/379).
Although, the issue was potentially a bit more nuanced, so in case we see some of these same issues crop up again, here's some more detailed notes:
DNS problem: query timed out looking up CAA for usda.gov
. Things did just suddenly renew successfully, but I spent some time debugging this, so here's a few notes for reference in case this pops up again:
dig +trace -t type257 usda.gov
times out for me), so this might point to some issues with USDA's name servers. I'm not entirely sure about all this, but here was a similar thread about DNS servers that timed out handling CAA records: https://community.letsencrypt.org/t/is-there-any-way-to-pass-validation-if-dns-server-does-not-respond-to-caa/27791 The verdict of that discussion seemed to be you'd either need to fix the DNS servers or you couldn't use Let's Encrypt. So if we do see this again, we might need to discuss with USDA.nal.usda.domains.api.data.gov
CNAME, since the ancestor subdomains of usda.domains.api.data.gov
and domains.api.data.gov
didn't have valid DNS entries themselves.Re-opening this, since this cropped up again for our USDA.gov subdomains (but this time the issue is consistently happening, so it's a bit easier to track down).
For a brief summary and background of what's happening:
api.ers.usda.gov
, since the CAA lookup that eventually checks the root usda.gov
domain times out.Let's Encrypt has a useful page outlining the ins and outs of these CAA records: https://letsencrypt.org/docs/caa/ They note timeout issues are likely an issue with the DNS's authoritative name servers:
Sometimes CAA queries time out. That is, the authoritative name server never replies with an answer at all, even after multiple retries. Most commonly this happens when your nameserver has a misconfigured firewall in front of it that drops DNS queries with unknown qtypes. File a support ticket with your DNS provider and ask them if they have such a firewall configured.
After some testing, the issue does seem that CAA queries against usda.gov's name servers time out for some reason. So while ideally this could be resolved on usda.gov's end, I'm not sure how feasible getting all that fixed is.
But luckily, we now have a way to sidestep this issue, since we can now handle the CAA lookups on our end. As Let's Encrypt mentions, CAA validation follows CNAMEs:
CAA validation follows CNAMEs, like all other DNS requests. If
www.community.example.com
is a CNAME toweb1.example.net
, the CA will first request CAA records forwww.community.example.com
, then seeing that there is a CNAME for that domain name instead of CAA records, will request CAA records forweb1.example.net
instead.
So this means we can setup CAA records on our end for the domains that each agency CNAMES to. This wasn't possible until a couple days ago when Route 53 added support for CAA records, but luckily the timing of all this worked out, and adding CAA records on our end does indeed solve this and allow us to renew SSL certificates for usda.gov subdomains.
As noted above, since we can now add CAA records in Route 53, so I think this should really be fixed now. We can now sidestep any CAA misconfiguration issues on agency's root domains by defining CAA records for our CNAME domain. I've added additional documentation on adding these CAA records as part of our subdomain setup, so this should be taken care of moving forward: https://github.com/18F/api.data.gov-ops#setting-up-agency-subdomains And as a bonus, defining CAA records for all of our CNAME domains is a good idea anyway (both from a security perspective and in case agencies start adding CAA records of their own).
And one small thing to note mainly for reference is that we need to add CAA records to each of our agency-specific DNS records. We can't just add a single CAA record to something like domains.api.data.gov
, since Let's Encrypt does not implement "tree-climbing" to check parent domains for the underlying CNAMEd value:
The CAA RFC specifies an additional behavior called “tree-climbing” that requires CAs to also check the parent domains of the result of CNAME resolution. Let’s Encrypt does not implement tree climbing because it makes expressing certain CAA policies impossible. After discussion on the IETF mailing list, we achieved consensus that tree-climbing in CAA is not ideal, and there’s an erratum for the CAA RFC removing it.
We have a couple agency subdomains that are managed with our automatic SSL setup that haven't been renewed on the expected schedule. The certs are still valid, but one cert will expire in 18 days, the other in 26 days. Both of these should have already been renewed automatically.
After looking at logs, it looks like our renewal script has been timing out during the renewal process. While I'm not completely certain why the renewal has been timing out, there have been several fixes in the lua-resty-auto-ssl library since we last upgraded our servers. One of the issues was also related to the timeout not being as long as expected (https://github.com/GUI/lua-resty-auto-ssl/issues/11#issuecomment-265650064). So we should upgrade lua-resty-auto-ssl, which I think will probably address this.