fabiolb / fabio

Consul Load-Balancing made simple
https://fabiolb.net
MIT License
7.26k stars 618 forks source link

SSL Certs from Vault #70

Closed far-blue closed 8 years ago

far-blue commented 8 years ago

I love the concept of fabio where you cut out all the middle layers and simply route according to the service records. I'd love to see TCP routing for ssh and mysql services but that's a different issue ;)

What I'd like to suggest here is that SSL certs are fetched from Vault. This would allow services to have auto-generated certs based on Vault's PKI support which has improved greatly in the last couple of releases. I believe the Rest API for Vault is very simple if you are just requesting certs and then it's just a case of tracking expiry - which can be done in memory because restarting fabio you can just request fresh certs. You could even fetch certs lazily on first routing request.

magiconair commented 8 years ago

Yeah, that has been on my wish list for a while (see #27)...

jefferai commented 8 years ago

Vault guy here! Got pointed this way :-)

This would be super cool -- you are probably aware of this, but there is a function that lets you get a certificate based on a client-supplied host name (GetCertificate at https://golang.org/pkg/crypto/tls/#Config) so you could definitely fetch-on-demand.

Looked through the other ticket -- there's definitely room for both LE and Vault, as they really tackle different use-cases. LE is designed to provide certs in an automated way to the Internet infrastructure, but doesn't work well within an organization. Vault's PKI support is designed to provide certs in an automated way within an organization, where you don't need to issue certs acceptable to the wider Internet but you need to issue a large number from a root trusted internally. So for software like fabio, if Internet-facing, you can definitely see imagine fetching certs from LE for the front end, and your backend services fetching certs from Vault. If fabio isn't Internet-facing in your setup, it could also fetch from Vault.

magiconair commented 8 years ago

@jefferai I've got a question regarding the Vault integration. Right now I'm polling Vault every couple of seconds (default every 3s, no less than every sec) to get the list of certificates stored under a certain path. I am also renewing the token on every refresh.

I assume the path structure in Vault looks like this:

secret/fabio/certs
secret/fabio/certs/a.com cert=---BEGIN CERTIFICATE --- key=--- BEGIN RSA PRIVATE KEY ---
secret/fabio/certs/b.com cert=---BEGIN CERTIFICATE --- key=--- BEGIN RSA PRIVATE KEY ---
...

I don't care about the leases since I'm always replacing the certificates with whatever I get from Vault and I couldn't see how Vault would tell me when things have changed like Consul does.

Is this in line with how Vault should be used?

jefferai commented 8 years ago

Hi @magiconair

That sounds like much, much more traffic to Vault than should be needed (and many more token refreshes). Can you explain the design a bit more? Is there any reason not so simply key off refreshing from Vault based on the certificate's expected lifetime? (e.g. start checking at halfway until expiration, increasing frequency as you get closer to expiration)

magiconair commented 8 years ago

@jefferai The problem isn't about cert expiration but about detecting when another cert has been added or removed and how quickly fabio can pick this up without being restarted.

Think about how this would work in consul. You add a cert to the KV store and all watchers would be notified that something has changed. Then fabio can load the new list of certificates and replace the old one. Since Vault does not have such a mechanism for watching for changes I have to revert to polling. Am I missing something or does that make sense to you?

magiconair commented 8 years ago

Hi @jefferai

the design is as follows: a background process fetches the available certificates from a source (file, path, http, consul, vault), checks if there is a difference to the previous value and only then updates them.

This decouples the fetching of the certs from the serving, i.e. fetching certs cannot block the main proxy.

If I would fetch the certs on the first request I would have to deal with a stampeding herd on startup where thousands of requests would all try to fetch the cert at the same time. I could still funnel this through a lock but this has the potential of blocking the proxy.

magiconair commented 8 years ago

@jefferai your comment got me thinking. I should be able to keep this decoupled while at the same time fetch certificates only on demand.

jefferai commented 8 years ago

@magiconair In case it got lost, another recommendation for the GetCertificate function in https://golang.org/pkg/crypto/tls/#Config. You could use this to fetch certificates on demand and then simply memoize them.

In fact, this is exactly how we implemented certificate reloading in Vault. That function fetches the certs from disk and stores the parsed objects in memory; when a connection comes in and that function is called it simply returns the cert. However, when a SIGHUP comes in, it forgets that cert and re-parses the file on disk, then memoizes the new value.

This way you don't need to keep hitting Vault looking for new certs -- you can simply return the ones you already have, and maybe check now and again to see if new versions are available.

magiconair commented 8 years ago

@jefferai no that didn't get lost and that is the function I'm using for serving the certs and the certs are cached in memory until they change.

The problem is with fetching and when and how to trigger the reload. You rely on a SIGHUP which has to be triggered by someone or something. Also, if you run more than one fabio instance they'd all have to receive the signal more or less at the same time on different machines unless you build a coordination mechanism into vault. If that isn't there then this is a process that someone has to build and maintain which I want to avoid.

Consul offers the option to wait (long poll, waitIndex) for a change. That allows me to update the proxy routing table of all connected fabio instances at the same time without the need for external coordination. I'd like to achieve the same thing with the cert sources but since only consul offers the wait-for-change feature I've reverted to polling where necessary.

jefferai commented 8 years ago

You know when certs expire -- why not just fetch based on time, unless someone manually sends a signal to Fabio, at which point you could remove all certs from memory and treat all as fresh?

I honestly don't see any reason for polling here.

magiconair commented 8 years ago

Everything else in fabio is automatic. There are no signals to be sent and nothing to be configured. That's the design goal of it. Therefore, certificates should be available to fabio as soon as they are added to the store and they should be available to all fabio instances that make up a cluster more or less at the same time - ideally immediately.

So I either try to fetch the cert for an unknown domain on the first request, or I tell fabio to reload the certs manually or fabio checks whether something has changed periodically.

The first option requires some refactoring and has the potential for blocking fabio while the certificates are being fetched. What if I get lots of requests for domains I don't have a cert for? That might kill the cert store

The second option requires either some manual intervention or some glue code the user has to provide. Both are not in line with fabios design goals.

The third option is how fabio works now but it requires a database which notifies fabio when something has changed (i.e. consul) or I have to poll for changes.

jefferai commented 8 years ago

Everything else in fabio is automatic. There are no signals to be sent and nothing to be configured.

Then don't use signals. I just suggested that if you wanted a way for an operator to explicitly tell fabio to reload.

The first option requires some refactoring and has the potential for blocking fabio while the certificates are being fetched. What if I get lots of requests for domains I don't have a cert for? That might kill the cert store

It can block fabio while certs are being fetched, but after the first fetch it'll be memoized. Besides, it should only be blocking that single goroutine. You can memoize negative results with a retry timer for certificates that aren't available.

The second option requires either some manual intervention or some glue code the user has to provide. Both are not in line with fabios design goals.

I don't see why allowing an administrator to manually expire certificates is a bad thing.

The third option is how fabio works now but it requires a database which notifies fabio when something has changed (i.e. consul) or I have to poll for changes.

I think this option is fine as long as you poll reasonably. Polling the entire cert store every three seconds is completely wasteful. You're better off using a timer per certificate to control when you next poll, based on certificate lifetime. But that ends up basically looking like option number one.

My strong suggestion is option number one (with flavors of option three). Store a backoff time value, rwmutex, and certificate information in a struct; use a thread-safe data type to look up the appropriate struct for a name, do a read lock, and return the info if valid.

Separately, have a management thread that checks each certificate; if the certificate will be expiring soon, or does not exist, get a write lock and attempt a read from the certificate store. In either case set the backoff time to half of the remaining time until expiration. If a certificate doesn't yet exist, or if it has expired without being refreshed, get a write lock and do a read from the certificate store...if nothing comes back set the backoff time to some near value (say, 3 seconds) and try again later.

jefferai commented 8 years ago

@magiconair BTW, I'll be in Amsterdam next week for HashiConf EU. You should join us at http://www.meetup.com/Software-Circus/events/228747162/ !

magiconair commented 8 years ago

Hi @jefferai

Unfortunately, I'll still be on vacation until Tuesday but we can meet on Wed, 15 Jun since I'm presenting in the afternoon. I'll be at the venue in the morning.

Admin interaction is something I specifically don't want. I'll explain that during the presentation why :)

I'll think about this a bit more.

jefferai commented 8 years ago

Wow...I'm embarrassed -- I totally missed that you were talking at HC EU! Blame it on me being busy with releasing, blog posts, talks, training...

Looking forward to talking to you then. BTW -- I have zero issue with you wanting admin interaction. I truly only brought it up as an alternate method on top of automatic, because, direct control gives people warm fuzzies. But as per my post above I think we can get a good automatic solution that isn't resorting to constant pulling.