brave / star-randsrv

Go wrapper service for the STAR randomness server.
Mozilla Public License 2.0
7 stars 5 forks source link

Safeguard against hitting the Let's Encrypt rate limit #78

Closed NullHypothesis closed 8 months ago

NullHypothesis commented 1 year ago

Let's Encrypt imposes various rate limits. Perhaps the most important one is:

Renewals are treated specially: they don’t count against your Certificates per Registered Domain limit, but they are subject to a Duplicate Certificate limit of 5 per week. Exceeding the Duplicate Certificate limit is reported with the error message too many certificates already issued for exact set of domains.

I believe we ran into this rate limit the other day. (Unfortunately, I did not save the error message, so I'm not entirely sure.)

We need to put safeguards in place that prevent us from running into this rate limit in production because if we do, there's little we can do (short of using another CA) to get around this.

diracdeltas commented 1 year ago

let me know if you need me to ask letsencrypt for something here. (we are a sponsor)

NullHypothesis commented 1 year ago

I now hit this rate limit on our production instance:

2023/05/21 16:30:10 http: TLS handshake error from 192.168.127.1:22720: 429 urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates (5) already issued for this exact set of domains in the last 168 hours: star-randsrv.bsg.brave.com, retry after 2023-05-22T17:48:40Z: see https://letsencrypt.org/docs/duplicate-certificate-limit/

And indeed, our deployment requested eight certificates in the last few days: https://crt.sh/?q=star-randsrv.bsg.brave.com Unfortunately, it's easy for us to hit this limit: A single change to Kubernetes may result in a pod restart, which results in a new certificate.

@diracdeltas: Do you think Let's Encrypt is able to make an exception for us here?

NullHypothesis commented 1 year ago

Unfortunately, Let's Encrypt is unable to accommodate a higher rate limit.

I don't have a great solution to this problem. Perhaps we could maintain a backup domain (e.g., star-randsrv-2.bsg.brave.com) that would allow us to continue operations if the main domain cannot retrieve a certificate.

Either way, we need some kind of fix for this.

cc @DJAndries, @rillian

NullHypothesis commented 1 year ago

The AWS Certificate Manager may be another option, but at the cost of dealing with IAM role complexity similar to KMS. While the docs suggest that this only works with NGINX and Apache, it looks like this may also work with Go's built-in Web server that we use.

DJAndries commented 1 year ago

The AWS Certificate Manager may be another option, but at the cost of dealing with IAM role complexity similar to KMS. While the docs suggest that this only works with NGINX and Apache, it looks like this may also work with Go's built-in Web server that we use.

Yes, it looks like any application that supports PKCS11 can leverage ACM. My concern is, can end users trust us if we store our cert/keys in ACM, which could potentially be accessed from outside the enclave?

DJAndries commented 1 year ago

perhaps we could piggyback the cert key when https://github.com/brave/nitriding-daemon/issues/10 is implemented

NullHypothesis commented 1 year ago

The AWS Certificate Manager may be another option, but at the cost of dealing with IAM role complexity similar to KMS. While the docs suggest that this only works with NGINX and Apache, it looks like this may also work with Go's built-in Web server that we use.

Yes, it looks like any application that supports PKCS11 can leverage ACM. My concern is, can end users trust us if we store our cert/keys in ACM, which could potentially be accessed from outside the enclave?

Right, we faced the same problem when we considered using the AWS Key Management Service to synchronize keys among enclaves. IAM policies are complicated and don't lend themselves well to being audited by third parties like our users. Let's pursue other solutions first.

perhaps we could piggyback the cert key when https://github.com/brave/nitriding-daemon/issues/10 is implemented

Yes, I think that's the only practical way to do enclave synchronization for now. It may not help with this issue though. I worry about two things:

DJAndries commented 1 year ago

Yes, I think that's the only practical way to do enclave synchronization for now. It may not help with this issue though. I worry about two things:

Apologies if I'm misunderstanding, but wouldn't a key/cert sync directly address the issues that you have mentioned? If a pod replica requires a restart it could get a cert from another replica. Only the leader replica would request a cert via ACME, right? I guess if that leader replica is in a crash loop, we could encounter this issue.

rillian commented 1 year ago

We (or I, rather) mindlessly debug the Kubernetes config, which results in pod restarts, to a point where we hit the rate limit. That has happened twice so far. Basically, we only get to start the enclave pod five times in seven days.

Would it help to use the letsencrypt staging endpoint for the dev deployment and work on config changes there first? That has limit of 30,000 per week instead of 5. The downside is the CA cert will not be in the default trust set, complicating testing.

NullHypothesis commented 1 year ago

Would it help to use the letsencrypt staging endpoint for the dev deployment and work on config changes there first? That has limit of 30,000 per week instead of 5. The downside is the CA cert will not be in the default trust set, complicating testing.

It would not have helped in this particular case because the production environment is set up slightly differently from staging and dev (e.g., we use a different Docker repository for the nitro-shim).

NullHypothesis commented 1 year ago

Yes, I think that's the only practical way to do enclave synchronization for now. It may not help with this issue though. I worry about two things:

Apologies if I'm misunderstanding, but wouldn't a key/cert sync directly address the issues that you have mentioned? If a pod replica requires a restart it could get a cert from another replica. Only the leader replica would request a cert via ACME, right? I guess if that leader replica is in a crash loop, we could encounter this issue.

Right, but an enclave only syncs keys with identical enclaves. If we make a small change to the code, we're going to have to update the entire set of enclaves, including the leader. The current design doesn't allow for certificate continuity. Every source code change is going to result in a new Let's Encrypt certificate. We could however modify this constraint and, say, allow a leader enclave whose image ID can differ from the worker enclaves. The only purpose of this leader enclave would be to obtain and distribute the Let's Encrypt certificate, which would enable certificate continuity. The cost of this is more complexity.

NullHypothesis commented 1 year ago

Here's another safeguard: We could create a git commit hook for star-randsrv-ops that checks CT logs for star-randsrv.bsg.brave.com and prints an error message if we're close to the rate limit. Ideally, we would do this as part of Kubernetes but that doesn't seem so easy.

rillian commented 1 year ago

I wrote a quick trial script for this. One issue seems to be that crt.sh rate-limits or is otherwise uncomfortably slow. So maybe a ci job to annotate pull requests would be better than a git hook.

hspencer77 commented 1 year ago

The AWS Certificate Manager may be another option, but at the cost of dealing with IAM role complexity similar to KMS. While the docs suggest that this only works with NGINX and Apache, it looks like this may also work with Go's built-in Web server that we use.

Yes, it looks like any application that supports PKCS11 can leverage ACM. My concern is, can end users trust us if we store our cert/keys in ACM, which could potentially be accessed from outside the enclave?

Right, we faced the same problem when we considered using the AWS Key Management Service to synchronize keys among enclaves. IAM policies are complicated and don't lend themselves well to being audited by third parties like our users. Let's pursue other solutions first.

perhaps we could piggyback the cert key when brave/nitriding-daemon#10 is implemented

Yes, I think that's the only practical way to do enclave synchronization for now. It may not help with this issue though. I worry about two things:

  • We (or I, rather) mindlessly debug the Kubernetes config, which results in pod restarts, to a point where we hit the rate limit. That has happened twice so far. Basically, we only get to start the enclave pod five times in seven days. I don't have a great solution to this but I expect it to be less of a problem over time, as the Kubernetes config matures. (cc @hspencer77, in case he has other ideas).
  • The code (either star-randsrv or nitro-shim) has a bug that results in the occasional pod restart, which hits the rate limit. I assume we could mitigate this issue by disabling auto-restart in Kubernetes.

@rillian , @NullHypothesis : with Let's Encrypt, whats the workflow for an end user for validating the following:

  1. the certificate the enclave is using?
  2. that the certificate the enclave used can only be accessed by only that enclave and not any other enclaves?

Forgive me if this was discussed somewhere before, but I just wanted to make sure I get a full understanding of how Let's Encrypt is leveraged here.

rillian commented 1 year ago

whats the workflow for an end user for validating the following:

  1. the certificate the enclave is using?

The user connects to the enclave at the endpoint /enclave/attestation passing a 20-byte hex nonce parameter to detect replay attacks. The enclave returns an attestation document signed by the AWS nitro hypervisor. The client verifies:

  1. The passed nonce is returned with a valid signature from AWS.
  2. The hash of the TLS key is returned with a valid signature from AWS.
  3. The key hash matches the cert the enclave provided as part of the TLS session before the request was submitted.

This confirms the enclave has access to the same TLS cert that authenticated the request.

  1. that the certificate the enclave used can only be accessed by only that enclave and not any other enclaves?

This is harder. An auditing user reviews the published source code we claim is running in the enclave and satisfy themselves that it securely obtains exclusive access to the TLS cert, and that the cert and private key are not accessible outside the enclave.

Then they perform a reproducible build, obtaining an enclave image id. Then they check that image id matches the one signed by the AWS nitro hypervisor in the attestation document obtained as above. This confirms the audited code is actually the code running in the enclave, and has whatever properties they care about, in particular that the TLS cert never leaves the enclave and so data passed to the enclave is protected end-to-end.

Using another cert provider would be fine, assuming users can trust that their provisioning protocol to be equally secure.

NullHypothesis commented 1 year ago

@rillian , @NullHypothesis : with Let's Encrypt, whats the workflow for an end user for validating the following:

  1. the certificate the enclave is using?
  2. that the certificate the enclave used can only be accessed by only that enclave and not any other enclaves?

The workflow is:

  1. The user downloads our source code by cloning https://github.com/brave/star-randsrv.
  2. The user audits the source code to make sure that it's secure.
  3. The user compiles the source code to obtain the local image ID.
  4. The user runs the verify-enclave command line tool. This tool verifies that an enclave runs the given source code. Internally, the tool asks for an attestation document by talking to https://star-randsrv.bsg.brave.com/enclave/attestation. The attestation document contains the enclave's image ID.
  5. Finally, the verify-enclave tool does a bunch of checks:
    1. Make sure that the enclave's attestation document is signed by Amazon's root certificate.
    2. Make sure that the attestation document contains the nonce that was provided in the previous step.
    3. Make sure that the image ID in the attestation document is identical to the image ID that the user compiled locally.

If all these checks pass, the user has assurance that the enclave is running the code that the user audited in step 2.

Regarding your point 1: When fetching the enclave's certificate, the user knows that the certificate is authentic because its fingerprint is part of the enclave's attestation document. Remember that the attestation document is our root of trust. It binds an HTTPS connection (specifically: the fingerprint of the enclave's certificate) to Amazon's root certificate (which signed the attestation document).

Regarding your point 2: The user knows that an enclave's certificate cannot leave the enclave because the code (which the user audited) does not allow for that. There's one exception though: when enclave's synchronize with each other, they will share their HTTPS certificate.

Forgive me if this was discussed somewhere before, but I just wanted to make sure I get a full understanding of how Let's Encrypt is leveraged here.

No worries, it's complicated!

hspencer77 commented 1 year ago

@rillian , @NullHypothesis : thanks for the details. Really appreciate it. One high-level question: so looking at this request here - https://github.com/brave/security/issues/1300 - it seems that we 'trust' Let's Encrypt won't alter the certificate versus IAM restrictions in ACM (ref: https://github.com/brave/star-randsrv/issues/78#issuecomment-1563503014)?

I ask because if we are tracking Let's Encrypt certs issued for star-randsrv.bsg.brave.com (https://crt.sh/?q=star-randsrv.bsg.brave.com), then we could do the same with ACM (e.g. https://crt.sh/?q=repsys-ip-anon.bsg.brave.com). This would require some additional work though but given that how we currently have to deploy tokenizer (1 enclave per EC2 instance) we could leverage ACM for Nitro Enclave.

rillian commented 1 year ago

@hspencer77 IIRC we're using Let's Encrypt for two reasons:

  1. We were concerned the AWS Certificate Manager service would have the same problems that the Key Management Service does, that it's difficult to determine Brave-the-AWS-customer couldn't access the same cert outside the enclave and subvert user data protection.
  2. Using a cert from a separate org distributes responsibility, raising the barrier to collusion.

The demo code for using ACM in an enclave says the private key stays local, so maybe the first point isn't actually an issue? Does ACM have a rate limit like Let's Encrypt?

hspencer77 commented 1 year ago

@rillian , ACM does have a quota but it can be adjusted (https://docs.aws.amazon.com/general/latest/gr/acm.html#limits_acm). With regards to separation of responsibility, aren't we effectively doing that with AWS (e.g. we are trusting AWS won't allow access to the same cert outside our AWS account - which we do trust via shared responsibility model)?

NullHypothesis commented 1 year ago

Some more thoughts on this, which may or may not be useful: We don't actually need Let's Encrypt. A self-signed certificate would work just as well as long as the certificate is referenced in the attestation document. The benefit of Let's Encrypt is that all certificates automatically make it into the Certificate Transparency log, which deters Brave from rolling out malicious enclave images to (a subset of) users: If Brave were to deploy a malicious enclave image, the corresponding certificate would end up in the CT log, i.e., Brave would get caught.

Ideally, we would use self-signed certificates (thus freeing ourselves from Let's Encrypt's rate limit) but we then need another kind of append-only log that documents the evolution of enclave images.

rillian commented 8 months ago

Closing this as it isn't currently an issue with our deployment, and fixes are likely to happen at the nitriding level.