18F / api.data.gov

A hosted, shared-service that provides an API key, analytics, and proxy solution for government web services.
https://api.data.gov
Other
98 stars 43 forks source link

Allow setting up SSL for agency domains with SNI #269

Closed GUI closed 8 years ago

GUI commented 9 years ago

Currently when we setup an agency's API subdomain with api.data.gov (eg, api.agency.gov), we use an Amazon Elastic Load Balancer (ELB) to handle the SSL. We do this because it's the simplest, most compatible way to get SSL setup without getting lots of static IPs or running our own load balancer. However, the slight downside of this approach is that each new ELB incurs a fixed cost, regardless of traffic, simply for that ELB existing ($18/month). Ideally we wouldn't be incurring this fixed monthly cost for every new agency subdomain we setup, which is where SNI could potentially help. SNI allows a single server to handle SSL for multiple domain names without needing multiple IPs.

Amazon doesn't currently support SNI directly within the ELB, but it's possible to set this up while still using an ELB by using an ELB in TCP mode to pass the HTTPS request to our web servers. Then our web servers could handle the SSL certificates and decryption with nginx. So this would mainly just shift where SSL certificates get installed and the initial decryption happens from the ELB to our servers.

The downside of SNI is that it still has compatibility issues with some clients. Most web browsers are fine with SNI support at this point, however I think the API client scene isn't quite so hot. The couple of big ones are Python 2, which just gained SNI support last December in version 2.7.9. Java 1.6 also doesn't support SNI, which I think unfortunately might still be used quite a bit by our users. A variety of other libraries and bots may also have difficulties with SNI. Here's a pretty good list of problem clients, as well as a good strategy for analyzing which clients have failed due to SNI support: https://www.mnot.net/blog/2014/05/09/if_you_can_read_this_youre_sniing

So while I think we still need to analyze our usage analytics in more detail before rolling this out on any existing production sites, I think we should at least start allowing for this configuration (and perhaps even setup new agencies this way by default), since it paves the way for cheaper handling of agency subdomains.

I actually played with getting this setup several months ago. I'm a bit hazy on the details now, but I think the main thing we need to pay attention to is how to setup the ELB with nginx. Since we have to switch the ELB into TCP mode, we also need to switch the ELB to use the proxy protocol so that we can retain the original IP address information while in TCP mode (since we do care about that for rate limiting purposes). This means nginx also needs to be setup to use the proxy protocol, and I think that's where my last experiments stopped, since I wanted to do more sanity checks to ensure that enabling the proxy protocol on our nginx listeners wouldn't break any of our existing ELBs that are using HTTP (and not the proxy protocol). So basically we need to ensure we can mix and match TCP mode & proxy protocol ELBs with our existing HTTP ELBs (or we would need to switch everything over to proxy protocol).

GUI commented 8 years ago

We now support using SNI, along with automatically registering SSL certificates for any new agency subdomain (using Let's Encrypt). Automatically handling SSL certificate registration wasn't originally part of this issue (the idea is briefly mentioned in https://github.com/18F/api.data.gov/issues/295), but it became the reason for tackling some of this. It should hopefully smooth out and speed up one of the primary roadbumps we always run into when setting up new agency subdomains.

I think our basic strategy moving forward will be to setup any new agency with this SNI requirement and automatic SSL handling. Currently that's just api.18f.gov, but that will give us a good opportunity to work out any potential kinks with this new approach. If all of this proves to be sound, then we might begin to encourage existing agencies to switch over to this new setup at some point in the future (I think API clients without SNI support are definitely fading, and the selling point would be that the agency would no longer have to worry about SSL fees, remembering to renew, etc).

How It All Works

Since this new approach changes several things with our setup, I wanted to quickly document what this new setup looks like in more detail. Parts of the current setup are also somewhat temporary while we prove this concept, so we'll need to revisit some of this.

DNS Changes

As part of standardizing our approach to agency subdomains, we're going to start creating new subdomains off of api.data.gov for agencies to CNAME to. Previously, we would give each agency the direct ELB domain to CNAME to (so, for example, developer.nrel.gov is CNAMED to api-nrel-221909003.us-east-1.elb.amazonaws.com). Under the new standards, we'll create a new subdomain under *.domains.api.data.gov for each agency's subdomain and then have the agency CNAME to that (so, for example, api.18f.gov is CNAMED to 18f.domains.api.data.gov).

The reason for this is two-fold:

ELB and Proxying Changes

In terms of how all this fits together, we now have a dedicated ELB that we'll use just for these SNI domains. So while we do now have one more ELB, hopefully this will be the last one (and we can possibly retire the other ones if we eventually migrate existing agencies to this).

The primary change is that now this ELB is no longer terminating our SSL. Instead, we're proxying all the traffic (both HTTP and HTTPS) to our backend servers via TCP and the PROXY protocol (the PROXY protocol allows us to retain IP address information). Since we're using the PROXY protocol, the traffic is routed to separate ports on our backend servers (rather than directly to the normal API Umbrella instance running on port 80). This allows all of our existing ELBs proxying over HTTP:80 to continue to work as-is. We're also proxying to a completely separate nginx process running on the same server (for reasons I'll get into next).

So in rudimentary ASCII art diagram form, the flow might look something like:

[ HTTP:80 / HTTPS:443 ]
       |
[ SNI ELB ]
       |
       | TCP with PROXY protocol
       |
[ PROXY protocol/SSL terminating nginx instance (port 9080) ]
       |
       | localhost HTTP proxy
       |
[ API Umbrella (port 80) ]

The main difference from our current setup is the extra hop to our local SSL terminator (but since it's local, the impact should be minimal).

Automatic SSL Handling

This is where the fun stuff is! There's new functionality in OpenResty that allows for us to dynamically handle SSL certificates inside nginx using Lua handlers. I began playing around this with functionality in this lua-resty-auto-ssl project I created, and that's what this is now leveraging.

The overall premise is pretty simple: Whenever we receive an SSL request, we look at the SNI hostname, and if we don't already have an SSL certificate for it, we register one on-the-fly with Let's Encrypt. Since we're in control of the domain, we can easily handle the automatic domain name verifications and respond to Let's Encrypt's verification requests. The very first request might pause for 5-10 seconds while this registration and verification takes place, but then we store and cache the SSL certificate for all future requests.

Currently we're maintaining a hard-coded whitelist of domain names we will allow SSL registration for (mainly to prevent abuse, so we don't request a bunch of certificates of faked domains from Let's Encrypt). So for now, there is still a manual step of us adding new agency domains to this list, but I think we can eventually figure out a better way set this up.

The primary thing to note for now is that this currently runs a separate version of nginx on our servers, and the setup is a bit of a one-off (currently managed in this recipe). If all of this proves useful, then I'd eventually like to integrate this capability back into API Umbrella to tidy up the deployment of this.

A couple further notes on security: