confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
45 stars 75 forks source link

azure: document how to use on confidential VMs with `cc_kbc_az_snp_vtpm` #1000

Open katexochen opened 1 year ago

katexochen commented 1 year ago

Would be nice if there was a follow up document for more advanced use cases that could be linked at the end of the current azure README. I'm still struggling to set this up correctly.

bpradipt commented 1 year ago

@katexochen currently I'm doing the following to use vtpm attestation. It's bit raw but sharing it here in case it's helpful before we have a proper documentation.

My use case requires using two KBCs:

  1. offline_fs_kbc - for downloading container images inside the podvm from authenticated registry. This secret is shared with all the nodes in the cluster, so I just inject it via config drive during pod VM start
  2. cc_kbc_az_snp_vtpm - for downloading workload secrets from KBS.

So I build two attestation-agent binaries from the latest AA code and my podvm image have the following:

  1. /usr/local/bin/attestation-agent : this uses offline_fs_kbc and is executed by kata-agent/image-rs. This AA is built using the following options LIBC=gnu ttrpc=true KBC=offline_fs_kbc make
  2. /usr/local/bin/attestation-agent-vtpm : this uses cc_kbc_az_snp_vtpm and is started via the attestation-agent.service in the podns namespace so that workload container can access it. This AA is built using the following options LIBC=gnu KBC=cc_kbc_az_snp_vtpm make

I built a new binaries container image which includes the two AA binaries and used my custom binaries container image to generate the podvm image.

It might be possible to use the single AA binary to support multiple KBC. @mkulke did mention this earlier, but I'm yet to test it.

bpradipt commented 1 year ago

O/p from my test pod

bash-5.1# netstat -nltp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:50000         0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:50001         0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      1/nginx: master pro
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      1/nginx: master pro
bpradipt commented 1 year ago

cc @surajssd @mkulke

katexochen commented 1 year ago

@bpradipt Thanks for sharing. I'm not sure if I get the problem you are solving with those two KBCs. Is there any reason you couldn't put the registry secret into the cc_kbc_az_snp_vtpm AA?

bpradipt commented 1 year ago

@bpradipt Thanks for sharing. I'm not sure if I get the problem you are solving with those two KBCs. Is there any reason you couldn't put the registry secret into the cc_kbc_az_snp_vtpm AA?

I think it should be possible. But I need to find out a working combination.

surajssd commented 1 year ago

@bpradipt Thanks for sharing. I'm not sure if I get the problem you are solving with those two KBCs. Is there any reason you couldn't put the registry secret into the cc_kbc_az_snp_vtpm AA?

What @bpradipt is using the offline_fs_kbc to download the secret that will be used by local container runtime to authn with the container registry and download container images (which are not encrypted). While cc_kbc_az_snp_vtpm is used to download secret to be consumed by the workload.

For him to use cc_kbc_az_snp_vtpm for the images that are behind authenticated-registry he has to encrypt the image push it to a registry that allows anonymous pulls and then he could utilize cc_kbc_az_snp_vtpm for image pulling.

katexochen commented 1 year ago

@bpradipt I tried to set this up as you described. Did you also encounter the error I describe in #1037 when requesting resources from the AA in the podns?

bpradipt commented 1 year ago

@katexochen we faced the same issue and the reason is because all the traffic from podns is routed via the worker node (over the vxlan tunnel). This is the reason IMDS request is failing with 500. The IMDS traffic needs to go out via the Pod VM itself.

Sometime back we were having slack discussions on the same thing - https://cloud-native.slack.com/archives/C04A2EJ70BX/p1683052146802139

There are two solutions which we are exploring:

  1. Create podVM with two NICs and ensure only traffic on pod network goes via the vxlan tunnel. Rest all the traffic goes via the PodVM
  2. Use NAT for routing IMDS traffic via Pod VM - this was initially suggested by @surajssd and is simple

The following setting on the podVM will fix the IMDS issue. I haven't yet figured out the best way to automate this. One thought is to include it under https://github.com/confidential-containers/cloud-api-adaptor/blob/staging/podvm/qcow2/misc-settings.sh

# Set up a veth pair
ip link add veth2 type veth peer name veth1

# Move the veth endpoint to pod namespace
ip link set veth2 netns podns

# Assign IP on the host endpoint of the pair an IP and bring it up
ip addr add 192.168.100.1/30 dev veth1
ip link set up dev veth1

# Similarly assign IP on the pod network ns endpoint of the pair and bring it up
ip netns exec podns ip addr add 192.168.100.2/30 dev veth2
ip netns exec podns ip link set up dev veth2

# Ensure host can do routing
sysctl -w net.ipv4.ip_forward=1

# Enable NATing on the host
iptables -t nat -A POSTROUTING -s 192.168.100.0/30 -o eth0 -j MASQUERADE

# Inside the pod net ns, create a route for this particular host traffic to travel over the veth pair
ip netns exec podns ip route add 169.254.169.254/32 via 192.168.100.1 dev veth2
katexochen commented 1 year ago

Thanks @bpradipt, the mentioned code works great. :blush:

I haven't yet figured out the best way to automate this. One thought is to include it under https://github.com/confidential-containers/cloud-api-adaptor/blob/staging/podvm/qcow2/misc-settings.sh

As far as I understand, that file is executed through packer ad podvm build time, right? Howerver, the podns is only created at runtime through a systemd unit, so I guess we can't set this up in misc-settings.sh.

My idea would be to add this as a oneshot systemd unit. As it is only required for Azure (I think?) and include custom systemd unit files during the build. We can add a files dir to the azure directory and copy the files in copy-files.sh.

katexochen commented 1 year ago

Not sure if this should always happen, as it might not be required for other AAs. Maybe we would need a mechanism to build different podvm images based on which AA should be included, like you already mentioned in https://github.com/confidential-containers/cloud-api-adaptor/issues/1036#issuecomment-1573807962.

mkulke commented 1 year ago

I think an azure-specific one-shot unit for NAT would be a good provisional setup, it might require a bit of ceremony, like the suggested ./azure/files folder, but that's tolerable IMO. Eventually a solution for secure key release at runtime and a new approach for layer decryption will emerge from the kbs + image projects and we can switch to that later.

bpradipt commented 1 year ago

How about using a similar logic like existing one in misc-settings.sh https://github.com/confidential-containers/cloud-api-adaptor/blob/staging/podvm/qcow2/misc-settings.sh#L19 and creating a oneshot systemd unit or setting the instructions in ExecStartPre of kata-agent.service

This change is also required for AWS to access IMDS from within the pod.

I think this is important for the 0.6.0 release so I'll push a PR quickly. Let me know if you all think otherwise.

However eventually this issue needs to be resolved to avoid duplication - https://github.com/confidential-containers/cloud-api-adaptor/issues/899

mkulke commented 1 year ago

It's probably worth looking at the IMDS topic comprehensively. A peer pod accessing it's k8s node's metadata endpoint might be desired behaviour. A pod might inherit IAM privileges this way.

The VCEK can also be retrieved from AMD's KDS (providing chip ids from the SNP report as parameters). The KDS is just subject to stricter rate limits, that's why Attestation Agent atm retrieves the VCEK from the metadata service. There are many other ways conceivable to buffer the VCEK, maybe it's stored as a file on podvm startup, ...

katexochen commented 1 year ago

similar logic like existing one in misc-settings.sh

I'd say we can do this for a quick fix, but for the long run, using a big script with if/case statements isn't a great solution IMO.

However eventually this issue needs to be resolved to avoid duplication - https://github.com/confidential-containers/cloud-api-adaptor/issues/899

IMO we can keep this open until we have a long term solution.

bpradipt commented 1 year ago

similar logic like existing one in misc-settings.sh

I'd say we can do this for a quick fix, but for the long run, using a big script with if/case statements isn't a great solution IMO.

Agreed, already it looks unmaintainable ;) - https://github.com/bpradipt/cloud-api-adaptor/commit/f599faf53af270aded8fe92a0d9e78f64f34a19d

I'm ok to drop the above approach if you prefer an alternative and want to send a PR.

However eventually this issue needs to be resolved to avoid duplication - #899

IMO we can keep this open until we have a long term solution.

katexochen commented 1 year ago

I'm ok to drop the above approach if you prefer an alternative and want to send a PR.

I'm okay with merging it like this for now. :)