cloudposse / docs

Terraform Reference Architecture for AWS, Datadog and GitHub Actions
https://docs.cloudposse.com
Other
69 stars 26 forks source link

Mitigating Let's Encrypt Rate Limiting Issues #174

Open osterman opened 6 years ago

osterman commented 6 years ago

what

We're concerned about LetsEncrypt rate limiting issues. It's fair enough to switch our staging environment over to using Lets Encrypt's staging env, but I'm concerned about this in production.

why

It basically means we could be blocked from changes to our infrastructure if let's encrypt rate limits us again. So we need a solution to that in some respect. Naively we could switch to using a wildcard cert. *.example.net and just make sure all of the servers use the dns name of server-123-123.example.net

osterman commented 6 years ago

There are a few options.

option 1

Use an ACM certificate provisioned with terraform and associated with the nginx-ingress.

https://github.com/cloudposse/terraform-aws-acm-request-certificate

Reference implementation here: https://github.com/cloudposse/terraform-root-modules/tree/master/aws/acm

Then set the ingress annotations to leverage this ACM certificate (e.g. SAN for *.ourapp.us-west-2.staging.example.net, ourapp.us-west-2.staging.example.net)

AWS Service annotations


These are passed to the Helm chart in the helmfile.yaml https://github.com/cloudposse/geodesic/blob/master/rootfs/conf/kops/helmfile.yaml#L556-L557

option 2

Use a different operational domain for production to reduce sharing across stages. E.g. treat example.net as a staging domain and example.co as the production operations domain. This is what another one of our customers do. They incidentally use ACM certs as well, but only because we started this journey before kube-lego existed

other considerations

The likelihood of getting rate limited in production is small for a few reasons:

  1. Very few new services are launched
  2. Namespaces are seldom, if ever, destroyed
  3. certificates are still long-lived so requests to APIs are few and far between. They can be renewed earlier than the 90 day cut off and rate limits would have to be in effect for several days for it to utlimately fail or timeout.

The reason you're at elevated risk in staging is due to the large number of publically exposed services as a result of running "unlimited staging environments". By moving staging to the staging domain of Let's Encrypt, the risks of inducingn rate limits in production. By using an entirely separate domain in production, the impact is even further mitigated.