envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.64k stars 4.74k forks source link

Proposal: Envoy mTLS Private Key Protection with HSM #19217

Open giantcroc opened 2 years ago

giantcroc commented 2 years ago

Envoy mTLS Private Key Protection with HSM

Envoy uses boringssl for tls communication, and the enabling of mTLS can also protect the connection between services from being attacked and tampered, but there’s still vulnerabilities within the private key management in Envoy, as the private key is exposed as clear text data inside the memory. The proposal is to provide a selectable choice to enable the PKCS#11 standard inside Envoy together with Hardware Security Module (HSM) to serve better security for key management within Envoy.

Design

This proposal tends to increase the security level of cryptographic operations and key management inside Envoy. It will modify Envoy to provide an advanced-level security to save the secrets (Istio-Agent mTLS private key) securely inside HSM (SoftHSM as sample) based on the PKCS#11 standard.

Control plane

The workflow of control plane is shown in the diagram below. controlplane The main changes of workflow are shown by the blue arrows. 1.2 1.3 The SDS request no longer requires control plane to generate key pair and CSR. And the SDS response sent to envoy does not directly contain the private key, but config to generate HSM private key provider. A kind of proto definition of SDS response is shown below. We define a stage field to distinguish two SDS responses, stage in the first SDS response is set to “init” to indicate the stage of creating the HSM private key provider,and stage in the second SDS response is set to “cert” to indicate that tls communication can be performed after receiving the certificate.

message Config {
    string hsm_library = 1;
    string key_label = 2;
    string usr_pin = 3;
    string so_pin = 4;
    string token_label = 5;
    string rsa_key_config = 6;
    string ecdsa_key_config = 7;
    string key_type = 8;
    string stage = 9;
    // csr_config is a string composed of various configuration items
    // provided to envoy, which are used to generate CSR
    string csr_config = 10;
}

1.4 1.5 1.6 Then HSM private key provider will init HSM context once, and generate keypair in HSM and CSR with PKCS#11 engine. 1.7 Since envoy’s SDS does not support CSR datatype resource, we add a new gRPC that reuse the original SDS channel, such as UDS. And CSR will be sent to control plane in this gRPC request. 1.9 1.10 After control plane completes signing CSR, certificate will be sent to envoy through the second SDS response and HSM private key provider will be updated.

Data plane

The workflow of data plane is shown in the diagram below. 11111 2.3 When Boringssl perform TLS handshake in HTTPS connection, it will call signing/decryption function in HSM Private Key Provider. 2.4 2.5 2.6 And HSM will execute signing or decryption and return the result to HSM Private Key Provider. 2.7 Finally HSM Private Key Provider will return the result to Boringssl.

Key generation

When Envoy receives the first SDS response, the private key will be created in HSM when creating HSM key provider.And when Envoy receives the second SDS response, the private key will be loaded from HSM. For both cases, the output returned is a key handle instead of the plain-text key.

Key retrieval

While following the PKCS#11 standard, id and label are used as the identifiers to fetch the private key. Either of them is usable to fetch the identical key. And the key that was retrieved is a handle instead of the plain-text data block.

We also raise a proposal in istio community: https://github.com/istio/istio/issues/36296. @lizan @ggreenway @asraa @mattklein123

derekguo001 commented 2 years ago

@lizan @ggreenway @asraa @mattklein123 Could you help to review the proposal, please?

derekguo001 commented 2 years ago

As @mattklein123 suggested in https://github.com/envoyproxy/envoy/issues/1106#issuecomment-357450316 , "I would loosely suggest that we offer loadable "HSM" modules that can be plugged in. E.g., a static module could use a local HW HSM, another could call an external API, etc."

Could we do it like this? @htuch @PiotrSikora @ggreenway @ipuustin What do you think about it?

alyssawilk commented 2 years ago

cc @yanavlasov as well

ggreenway commented 2 years ago

Caveat: I've never directly worked with an HSM; I'm making some educated guesses.

Your diagram for the dataplane portion of this looks how I would expect it to work.

For the control plane, does Envoy need to be involved in the flow for generating the key/cert, and getting them signed? That seems like it could be handled entirely out-of-band of Envoy as strictly a control-plane function. I imagine in some setups, the cert/key would be shared among a fleet of envoy instances, so there would be an additional complication of picking 1 envoy to handle this cert/key generation and signing flow; if this were handled 100% in the control plane, it may be simpler in many ways.

yanavlasov commented 2 years ago

I agree with @ggreenway that this proposal assumes tight coupling between Envoy process and HCM. This is not going to be true for all cases. I think it is better that control plane is responsible for the dance of generating and signing key pair and then just handing Envoy enough information to use HSM (this will require changes to SDS response message). For something like SoftHSM one solution is to implement SDS proxy that is running on the same machine as Envoy. The SDS proxy will take care of generating and signing key pair and then tell Envoy via SDS the slot/key to use.

htuch commented 2 years ago

Generally agree with folks above, I'd actually be interested in getting the take from Istio folks on this; they have spent a fair bit of time thinking about how to do CSRs and key issuance with local agents which are then reflected back to Envoy via SDS. @lambdai @howardjohn

howardjohn commented 2 years ago

We have some discussion in https://github.com/istio/istio/issues/36296 regarding this - not sure it's reached resolution though

derekguo001 commented 2 years ago

@howardjohn @htuch We have a solution for that problem. Please refer to https://github.com/istio/istio/issues/36296. About the part in Envoy, could we follow the advice from @ggreenway and @yanavlasov ? Control plane is responsible for generating and signing the key pair and then Envoy uses them.

ggreenway commented 2 years ago

@derekguo001 I have no objection to envoy dataplane support for using an HSM for private-key operations.

derekguo001 commented 2 years ago

Thanks, @ggreenway . When implement the extension, there will be two options:

  1. We can implement the HSM private key provider extension as an HSM framework. For each specific type of HSM, there will a specific HSM plugin as an extension of the HSM framework.
  2. There will not be any HSM framework. For each specific type of HSM, there will a specific private key provider extension.

What do you think about it? @ggreenway @yanavlasov

ggreenway commented 2 years ago

For option 1, would there be any meaningful amount of shared code? Or is it just another abstraction layer on top of an abstraction layer?

derekguo001 commented 2 years ago

It's not just another abstraction layer. There could be some meaningful shared code in HSM framework:

In my opinion, option 1 is better than 2, the reason is that we can define the protobuf which includes the information related to PKCS 11 and then implement the shared code in HSM framework. When implementing the HSM plugin, we need to implement the sign/decrypt interfaces themselves and don't need to care much about how to interact with control plane.

What do you think? @ggreenway

ggreenway commented 2 years ago

That sounds reasonable to me. I haven't looked closely at the details, but from what you describe, I agree that it makes sense to have some shared code here.

The other question is whether we'll end up with multiple backends. My guess is that we will; probably something to interface with the provided services in the various large public cloud providers.

yanavlasov commented 2 years ago

The big question for me is signing of the CSR. As I understand the proposal it will be SDS server responsibility to do this signing? Is this the best approach? Should it be responsibility of the HSM provider to do the signing and communicate with the service that signs CSR, with its own protocol, not tied to SDS?

derekguo001 commented 2 years ago

That sounds reasonable to me. I haven't looked closely at the details, but from what you describe, I agree that it makes sense to have some shared code here.

The other question is whether we'll end up with multiple backends. My guess is that we will; probably something to interface with the provided services in the various large public cloud providers.

Thanks, @ggreenway.

We will support multiple backends. For example, there are three kinds of HSM providers. We can implement three HSM plugins.

derekguo001 commented 2 years ago

The big question for me is signing of the CSR. As I understand the proposal it will be SDS server responsibility to do this signing? Is this the best approach? Should it be responsibility of the HSM provider to do the signing and communicate with the service that signs CSR, with its own protocol, not tied to SDS?

Yes, @yanavlasov. SDS server will use the key pair in HSM to generate the CSR and then send it to CA and get the signed certificate.

The HSM provider can also do this. The problem is that not every HSM provider has this feature. It is not in PKCS#11 standard to send the CSR to CA and get the certificate. It might be better to let SDS server do it.

derekguo001 commented 2 years ago

Hi, @ggreenway @yanavlasov Do you have any more concerns or comments about this?

derekguo001 commented 2 years ago

We had some discussions about the SDS server in Istio community(https://github.com/istio/istio/issues/36296). There will be an HSM SDS Server which will be responsible for generating and signing the key pair. It's the same as we discussed earlier.

derekguo001 commented 2 years ago

If there is no objection, we will proceed as we discussed. Next we will make a detailed design proposal about HSM framework. If anyone else is interested, let's do it together.

@ggreenway @yanavlasov WDYT?

derekguo001 commented 2 years ago

Hi @htuch @ggreenway @yanavlasov @PiotrSikora for a more detailed explanation of this design proposal, we wrote a design doc, please take a look when you are available https://docs.google.com/document/d/1e5ZjD7kKbDeXLgevu3LkYsAiXtlizEjjTsfnjTxS5Z4

BTW, regarding the HSM in Istio, we have reached an agreement with the Istio community. Here are the issue and the design proposal in Istio:

qiming-007 commented 2 years ago

BTW, this proposal will base on SPIRE SDS server which has been merged in Istio @htuch @ggreenway @yanavlasov @PiotrSikora

derekguo001 commented 2 years ago

Hi, @htuch @ggreenway @yanavlasov @lizan We have simplified the design proposal and moved all the modifications into extensions. This is the new design proposal: https://docs.google.com/document/d/1Nk171M5hGdxMfMTf573HC2Slp4zPBViZNnk18OTqlRc

After the simplification, there will only be two new extensions: a bootstrap extension and a private key provider plugin, which are proposed as contrib extensions. So we don't need to change any envoy core code path.

yanavlasov commented 2 years ago

@derekguo001 This new proposal makes sense to me. It addresses the main concern that API maintainers had that xDS protocol which is pub/sub was used for handling CSR, which does not fit into pub/sub model. However the new design makes it transparent for the SDS protocol and as such addresses the only concern.

Let me know if there is anything else we can help with.

derekguo001 commented 2 years ago

@yanavlasov Exactly! It does not fit into pub/sub model. The problem had troubled us for a long time. At last we thought we could use an out-of-band channel.

mattklein123 commented 2 years ago

Yeah this makes sense to me also. Note that there is some overlap with https://github.com/envoyproxy/envoy/issues/18928 (specifically my comment here https://github.com/envoyproxy/envoy/issues/18928#issuecomment-1119720609) in terms of having an API that can generate certs, so it would be nice to see if there is any functionality/design that can be shared.

derekguo001 commented 2 years ago

Hi, @mattklein123 Thanks for your advice.

We discussed the two proposals and the overlap with the owner of https://github.com/envoyproxy/envoy/issues/18928 @LuyaoZhong .

In the TCP Bumping proposal, they plan to use tls_certificate_provider_instance to generate the key/cert. However, as SDS API is the standard implementation for Secret, in current proposal we follow the SDS API to get the key/cert.

After discussion, they will investigate whether it is feasible to use the SDS API in TCP Bumping. If they could use SDS API, it means that they can reuse all the functionality of the current SGX proposal.

yanavlasov commented 2 years ago

@derekguo001 I think the idea is to make it possible to use API for generating and signing the certificate generic enough such that it can be used from other contexts, not just SDS.

Also I have commented on the proposal doc. I think the machinery of establishing two connections to the same SDS service will be difficult and in some cases impossible to implement. Even if you are using the same gRPC client object it may not work. I think that part needs to be redesigned.

I think you should relax the requirement that SDS always provides the cert and make it such that instead of providing the cert, SDS service "tells" Envoy to generate and sign the cert using provided config. In this way you do not need the out of band channel to be connected to the exact same process that handled the initial SDS request.

derekguo001 commented 2 years ago

Hi, @yanavlasov . Please let me explain in detail why we designed it like this and the factors we have considered.

There are some reasons we want SDS Server to provide the key/cert.

  1. They key/cert pairs may be generated automatically. They may also be created by users(for example, when Envoy works as a gateway). We want to handle both cases in the same way. If we let Envoy itself generate key/cert, we won't be able to use SGX to protect the key/cert from users. In such case, to protect the key/cert from users, we need to introduce another process which is very similar to the current one in this design proposal.
  2. If we let Envoy generate key/cert, in order to sign the CSR and get the cert, Envoy needs to create another out-of-band channel to CA. This will introduce the similar complication. If Envoy doesn't talk with CA, it would need to send CSR to SDS Server. And then SDS Server would send the CSR to CA. In such case there would still be an out-of-band channel between Envoy and SDS Server.

About the out-of-band channel between Envoy and SDS Server.

As SDS Server is out of scope of Envoy, I didn't explain in detail about the SDS Server in this proposal. In fact, in this proposal the SGX SDS Server is per-node. When Envoy works with Kubernetes, the SGX SDS Server may work as a daemonset. The SGX SDS Server and all the Envoy instances on the same server will share a UDS file which will be used as the out-of-band channel.

In this way, there won't be many UDS files or channels for Envoy instances and SDS Server. There is also a similar proposal about Spiffe/Spire in Istio community which is approved. https://docs.google.com/document/d/1zJP6QJukLzckTbdY42ZMLkulGXz4gWzH9SwOh4xoe0A In that proposal the SDS Server is also per-node. We can use a similar approach in current case. For our case, the proposal about SDS Server is also accepted by Istio community.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

zhanghw0354 commented 2 years ago

I’m from CMCC(China Mobile Communications Corporation). CMCC are providing customers with a full-stack cloud-native product family, helping customers to apply agile, business intelligence, security and trustworthiness. Details can be found in https://ecloud.10086.cn/home/support/cloudnative (sorry for Chinese only web page) Security is vital to us and our customers and we have been looking for security solutions in cloud native/service mesh domain. This feature - leveraging Intel SGX technology - can be used to secure private keys in service mesh. We are very interested in the feature and willing to be the sponsor.

mattklein123 commented 2 years ago

Thanks @zhanghw0354 sounds good to me!