Guest Identity Brainstorming

Over the years, we have had a number of conversations about introducing an identity mechanism to Confidential Containers. Identity is one of the most subtle parts of confidential computing, so these conversations have mainly resulted in confusion. Identity is an overloaded term, so it's not exactly clear what we are aiming for (if anything). I think the first step to implementing something will be agreeing on what we actually want in terms of identity. Let me try to summarize a few of the things that have been mentioned so far.

SPIFFE/SPIRE We have talked about SPIFFE/SPIRE support since even before we were a CNCF project. SPIFFE/SPIRE are widely used and at some level seem like a great fit for Confidential Containers. It seems like it could make sense for pods/containers or even individual guest components to have a SPIFFE identity. Unfortunately the technical architecture of SPIRE is not very easy to lineup with Confidential Containers.

Workload Type/Class Another way we've talked about workload identity is as a field that would be used to tell the KBS which resources to provision. This is really more of a workload class than a workload identity although it could be scoped to an individual workload. This kind of field comes up when we are trying to push resources into the guest from the KBS. Currently we just identify every resource by its own individual resource URI. This is fine, but it could be a good idea to map certain resources onto a workload. This would simplify the provisioning of the guest (the client wouldn't need to request a bunch of individual resources) and would allow the KBS to more easily limit access to resources to fixed configurations.

VM Instance Sometimes people have an instinct that they should track a certain instance of a VM. For example, they might want to make sure that the VM they request from a CSP is actually the one that they are accessing. We have long maintained that this doesn't actually matter or is at least not a question of confidentiality. As long as you get a VM that is attested correctly, you don't need to worry about "which one" it is. Furthermore we are mainly trying to efface the VM abstraction rather than expose it. I think this still stands, but maybe with peer pods, where podvms are rented from the cloud (which is potentially expensive), we actually do care more about keeping track of the VMs.

Evidence Factory Attack Prevention We've realized that having generic evidence makes us more susceptible to so-called evidence factory attacks where the valid evidence of a podvm is used to authenticate with a KBS that does not belong to the workload provider. There are a number of different ways to mitigate this, but many proposals involve differentiating the evidence in some way. The evidence could be scoped to a particular KBS, to a particular workload, or to a particular instance.

Hardware Identity Some platforms have IDs for individual VMs or ways of deriving unique keys inside of a guest. In some ways it might seem valuable to have an id derived from the hw root of trust, but the design of these key derivation APIs can be a little bit limited imo, and it's hard to know how to reconcile differences between platforms.

It's not clear to me how these different concepts might fit together. Three recurring questions come to mind. First, what should the scope of the ID be? Some of these ideas are scoped to a VM, other to a pod or a container. Second, how should the id relate to the root of trust? Should it be measured? Third, where does the id come from? Is it pushed from the control plane or generated inside a guest or does it come from a KBS or something else?

What other ideas about identity are there? Does one of these models seem most significant?

@fitzthum Thanks very much for bringing this up and the contexts! I want to share some point of views of my own. Let me try to deliver in easy words.

Background: Confidential Containers Threat Model

We have some good documents about the CoCo's threat model.

About the personas: https://github.com/confidential-containers/confidential-containers/blob/main/trust_model_personas.md
Threat vectors arised by @magowan which I highly agree with: https://github.com/confidential-containers/confidential-containers/issues/118

To discuss the guest identity problem, we must first agree on a threat model.

Let's simplify the roles in CoCo:

untrusted host: including orchestration (like kubernetes) and infrustration (like CSP)
workload provider: users who defines the k8s pod description yaml. They are the direct user of CoCo who aim to deploy containers (workload) protected by Tee. To make it easy, here we do not distinguish the container image provider and workload provider.
Data owner: users who provides the input data to the workload. If we say that workload providers provides the "code" to Tee, then data owner provides "data" to tee to be calculated.

From my point of view, CoCo ONLY CAN help

workload provider believe that no unexpected software control APIs by untrusted host would be called into the Tee to read/write any data inside the Tee. This is done by remote attestation (evidence shows hardware/software info about Tee, and reference value shows what non-malicious hardware/software should be)

CoCo CANNOT help

block unexpected read/writes from the untrusted host via hardware vulnerabilities.
help the data owner build trust in the workload provider. All CoCo can do about this is to provide an evidence including the measured workload to data owner. At that time all a data owner knows is a measurement of the workload. Whether the data owner will trust the workload is not covered by CoCo. More concretely, the workload provider might show some golden values of the workloads on its official web page, etc.

It is very important to make the threat model clear to avoid some overthinking.

Now I think we can reach a same point that after remote attestation, the workload provider can trust all the things inside the guest, which relies a fact that all things inside the guest are covered by the remote attestation evidence.

Thus, it is safe for workload provider to store confidential resources inside the guest. This is also what we are doing.

Related attacks

container escaping

About Evidence Factory Attack, an attack vector is decribed that an escaped containers can read/write things inside the guest. Also, it can perform another remote attestation to access KBS for confidential resources. IMO this is not even an attack. As the containers running inside the guest is defined/provided by workload provider, an escaped container can only access the workload provider's injected confidential resources and the status of the guest which has been attested.

Another scenario is that a container provided by untrusted host runs inside the guest and escapes. This can also be filtered by a container policy.json. Currently container launch order prevents such attack.

Evidence Factory Attack

This is a core attack related to guest identity. If no identity is brought, there will be a bunch of guests with same evidence and thus same privilege to access KBS. What we want is that even if the base TCBs are same, the guests can be distinguished when defining the workloads running inside.

A sketch of an overall design

First, let's recap the remote attestation.

CoCo's core aim is to protect the data/code confidentiality and integrity. Therefore there must be a time some confidential data or code will be injected into the guest. This is very important. Try to think that if no confidential data/code is injected into the guest, we do not even need CoCo.

Therefore there must be a time the workload provider/data owner ensures the guest environment can be trusted. This is done by remote attestation. A very important thing is that the remote attestation helps the guest to have privilege to access the confidential data (or we say it is approved to be injected confidential data/code. Only the subject and object order in the description are different)

Let's summarize the process: the guest wants to access confidential code/data from workload provider/data owner. The guest do RA to convince the workload provider/data owner. The guest has privilege to access confidential code/data.

We can see that remote attestation and resource access authorization are strongly bound, and the identity of the guest determines what resources can be accessed. Or we can say the identity of the guest determines the privilege of the guest.

Then, let's start to design. First, some principles:

The subjects that want to access the confidential resources are mainly two types: the privileged components in guest like kata-agent, confidential data hub. The workloads, s.t. application containers. They are different subjects and SHOULD have different privileges. Thus a layered authentication scheme like SPIFFE/SPIRE should be used.
The identity must be included inside the attestation evidence. The reason is as mentioned -- remote attestation and resource access authorization are strongly bound. If identity is not included inside the evidence, the evidence can be abused to make evidence factory attack. There are some different ways to include identity in the evidence: i. include in TEE's HOSTDATA like field. This applies to TDX/SNP/SGX. SEV cannot. ii. include via a vtpm's evidence. This applies in all archs at the first glance.
If we use KBS, the KBS public key cert must be included in guest before remote attestation occurs. This helps the guest connects to a fake KBS. Also, the KBS cert should also be measured. This helps the untrusted host injects a malicious KBS cert then the guest connects to. Most importantly, the KBS must some audit mechanism to know whether specific guest has connected.
KBS needs some refactoring. As we can see, KBS has different functionalities: resource keeper, access authorization via RA. We should decouple the access authorization part. Let's call the access authorization part "authorization service".
guest id is recommended to be provided by the workload provider. In the context of confidential container image, it is the workload provider who authorizes the guest to access the image. Thus logically the id should: i. either be defined by the workload provider. Then when RA occurs, due to the id the workload provider's authorization service can authorize the guest. ii. or be defined by the untrusted host. But when the id is defined, there must be a way for the workload provider to know the id before RA occurs. From the point of view of convenience, way i seems better.

Then let's see the draft of the process. Some names and components are new but I think when reading to here it can be conprehended.

One thing ignored is if the privileged components like kata-agent and CDH inside the guest want to access the resources, guest identity document/SVID should be used.

About the authentication protocol based on remote attestation, I think KBS protocol can be a good base. The SVID can be a cert, which can be used to both resource access (see https://github.com/confidential-containers/kbs/issues/143) and signing SVIDs for container SVIDs.

Undoubtedly, Attestation in confidential computing can be used to assist in addressing this issues. Compared to Authentication, Attestation allows the Verifier to check the execution environment of the party sending the Evidence. Authentication, on the other hand, can only verify the credentials of the sending party to determine their identity, even if the credentials should not have been possessed by the sending party (e.g., credentials stolen by an attacker). Ideally, Attestation should assist in Authentication, but currently CoCo lacks the approach used to address the unique TEE instance-level identity.

From the perspective of the Root of Trust for Report (RTR), TEE Attestation does not offer much more than TPM Attestation. Let's see how TPM addresses the identity of TPM instance: Before leaving the factory, TPM is provisioned with its unique identity information called Endorsement Key (EK) certificate in a secure environment. During the provisioning phase, TPM generates an Attestation Key (AK) to issue the TPM Quote (aka TPM's Evidence) and initiates a registration request to the Registrar service. Then, through the classic tpm2_makecredential and tpm2_activatecredential processes, TPM proves to the Registrar service that it indeed possesses the EK. Finally, the Registrar service issues the certified AK certificate to TPM. When TPM issues a TPM Quote to the Attestation Service (AS) signed by AK, AS can thoroughly verify the TPM Quote as long as it has legitimate EK and AK certificates obtained from the Registrar service as endorsements.

Of course, the significant difference between TPM and TEE attestation lies in the fact that TPM is a device while TEE is an instance. But this is precisely the crux of the problem: we need to address the identity of the TEE instance, not the identity of the TEE platform. Therefore, we can draw inspiration from TPM's solution to address the identity problem at the TEE instance level.

To solve this problem, it is necessary to ensure that the entire lifecycle of the TEE instance includes the following three processes:

Identity provisioning process: Requires the ability to securely and reliably provision a unique identity identifier to an anonymous TEE instance.
Identity registration process: Requires TEE instances with unique identity identifiers to submit identity registration information to a trusted registration service.
Identity authentication process: Requires verification of the identity of the TEE instance represented by the Attester through a trusted Attestation Service.

Currently, CoCo still lacks the first two processes.

The following diagram provides a brief explanation of the first and third processes, which I previously designed for TDX: 截屏2023-09-04 10 11 02

The identity registration process involves submitting registration requests to the Registrar service, which can be implemented using existing CoCo Attestation, such as adding Registrar functionality on the KBS side.

A few comments. I think these are both reasonable (and fairly similar) approaches. I'm still not sure exactly what our target should be. One thing that might help us focus is if we can figure out some concrete use cases where some kind of identity is required.

A few things to note. First of all, right now we do have a confidential identity mechanism but it isn't formalized. Basically, the confidential identity of the guest is defined as the set of secrets/resources that the KBS sends to the guest. You could argue that there is also a non-confidential guest identity defined by the workload yaml.

@Xynnn007 mentions the relationship between the workload provider and the data owner. I think this should maybe be left to a separate discussion. So far we have mainly been thinking in terms of host/guest. This is natural given that TEEs really only provide for two contexts. Nonetheless we should probably think about how multiple entities can establish trust relationships inside the guest (if it is possible). Maybe identity could help with this in some way, but I think we should keep the discussions separate for now.

One thing to note is that we can resolve evidence factory attacks without adding any identity mechanism. Generic guests are not inherently non-secure. The key is that generic evidence must only be available in a privileged context. Our project doesn't really meet this requirement because the attack surface of the container->guest API is very large. As I've mentioned in the past though, poisoning the evidence before containers start could resolve this problem (with some tradeoffs).

Another thing I would push back on is that @Xynnn007 mentions that the public key of the KBS should be measured. I think it's worth digging into what that would actually guarantee us. Measuring the KBS identity would scope the evidence to the KBS, which could be valuable, but that just prevents someone from reusing evidence (like in the case of a container breakout) with multiple KBSes. It wouldn't fundamentally guarantee that a workload is ever connected to the correct KBS. Why not? Because the KBS/AS is the one that checks the measurement. This is circular. There are some extra complications, but I think this is a minor part of the proposal and we have probably discussed before so I will just leave it there.

I think the comparison that @jiazhang0 makes with the identity of a TPM is very interesting. I will need to think about this a bit more. In my experience it can actually be quite cumbersome to manage the fixed identities of TPM devices. I have also been working a bit on attestation for confidential vTPMs lately. Provisioning an identity to a vTPM or confidential vTPM is also a challenge.

Let me use simple terminal tenant.

As I've mentioned in the past though, poisoning the evidence before containers start could resolve this problem (with some tradeoffs).

Agreed. What I want to express is that we can poison the evidence by using init data injection mechanism. My first impression is to reuse and extend the runtime policy Api SET_POLICY.

But logically we must have a consistency that how to poison the evidence should be known and determined by the tenant, or the host cam fake. If we agree on this, it can be easily inferred that we would have a component to admin and check whether an evidence has been poisoned in a pre-defined way. Let us call that component authorization service.

About kbs cert be measured

Yes, agreed that the verifier of the measurement is KBS/AS, thus leaving a cyclic situation.

maybe we should recheck the need of the tenent. The tenant mainly wants to check whether the guest is launched as expected.

if the tenant wants to know whether the guest is launched is as expected in attestation way. There should be a timer in the authorization service. after a given time if no attestation request including a specific id, the tenant can know that the guest is not launched as expected.
if the tenant wants to know whether the guest is launched as expected in a nom-attestation way. Well, the tenant can access its own service and check.

@fitzthum I think the essential problem we try to address is still the evidence factory attack, because a TEE instance/Pod is anonymous at the beginning. Hardware identity cannot resolve this issue because it is used to recognize a TEE platform rather than single TEE instance. About VM instance, people will (but currently not) have the similar requirement to resolve evidence factory attack.KBS Type/Class as a separate topic is about the fine grained resource authorization and multi tenancy support, because they are general security design patterns not specific for confidential computing. With the TEE instance identity support, the workload identity can be built on top of it. TPM usage is just an example to understand the process of identity provisioning. Confidential vTPM is another use case, but I also admit it has the similar problem on secured identity provisioning.

So I suggest we draw inspiration from SPIFFE identity to implement a TEE instance/Pod level identity for CoCo as the initial target. It can establish a trustworthiness to the anonymous TEE instance/Pod using attestation, registration and certification during runtime rather than the factory provisioning like TPM usage. The conceptual work flow is as following: 1) Guest owner generates a CSR with OTP(One-Time-Passcode) and something else necessary, and then deploy it via K8s manifest. 2) VMM injects the CSR to TEE and then launch it. 3) AA launches a RA to report CSR and other claims to verity it, especially ensuring OTP is valid and then revoked. 4) Registrar service certifies the TEE instance according to the verification result for CSR. 5) TEE provides its TEE instance certificate in each resource request to uniquely identity itself.

One interesting thing about these proposals is that they are creating a more complex relationship between the host and the KBS. The host and the KBS have to coordinate the identity. This might be inevitable to get some properties, but it has some downsides. I think one fundamental question is whether identity should be driven more by the host or by the KBS. The first approach might be more inline with traditional kubernetes ideas because it leans on the control plane, kind of like the new sealed secrets feature. Driving things from the KBS side might be simpler, though.

The following diagram provides a brief explanation of the first and third processes, which I previously designed for TDX:

Coming from a discussion on slack: https://cloud-native.slack.com/archives/C039JSH0807/p1728241478038709

I was raising the issue for the need of a "user authentication" for the KBS, to protect from unauthorized TEEs. As the above conversation is quite complicated and nuanced, I'd like to make our scenario / use-case clear:

We are the "container image provider". We have a publicly reachable KBS server deployed on our servers. Our encrypted docker image is publicly available. We now want to protect who (which "workload provider") can execute the workload on their TEE and which don't. We also want to be able to change that in the future (removing access - given the workload provider has a copy of the Docker image). We do not want to give blanket policy allowances but we want to whitelist individual workload providers. We do not want to build a custom Docker image per workload provider. All should get the same image. We "simply" want to hand out authentication credentials to individual people or group of people so that we can manage the access to our Docker image, even in the scenario where we handed the image out to an external host.

A (valid) suggestion that came up was: "Implement the authentication yourself from within your container": Technically, this a seemingly easy solution and it keep the KBS from having to implement such an authentication mechanism. The issue with that is, that a common limitation we have with our customers and partners is that our docker container can not have any internet access (no outgoing traffic). This includes any such authentication communication. Having this as part of the KBS, moves this to a third party. We can deliver our algorithm/image as expected and still benefit off of control over authentication.

Does that fit into your discussion?

@Spenhouet

Currently there are a couple ways to differentiate pods that use the same encrypted container.

First, we have init-data, which allows the host to assign configuration values to a guest. These values are part of the attestation report and cannot be modified from inside the guest. While you don't trust the host, you could use it to provision a UUID or some other value, which is then validated by the resource policy in the KBS. Since there are a lot of different thoughts about identity, we don't yet implement one identity mechanism, instead the init-data offers a generic way to send any sort of measured data to the KBS policy. Like I said, you could put some kind of unique identity here, or a workload class, or a workload owner id, or really anything. The bad news is that this isn't upstream in Kata yet.

Another thing you might want to look into is the sealed secret feature. On the surface sealed secrets are just a way to store wrapped secrets in an untrusted control plane, but there are also some deeper implications. Sealed secrets can be unwrapped with the help of an HSM, which the CDH connects to from inside the guest. This allows us to decouple secret delivery from the KBS to some degree. For example, you could have the KBS be owned by party A and the HSM by party B. Party B would give party A credentials to access the KBS. To unseal a secret the guest would request the credentials from the KBS of party A and then use those credentials to connect to the HSM of party B, which would be able to do access control. This doesn't quite map onto your scenario with the encrypted image, but it's something to think about.

confidential-containers / guest-components