edgelesssys / contrast

Deploy and manage confidential containers on Kubernetes
https://docs.edgeless.systems/contrast
GNU Affero General Public License v3.0
160 stars 6 forks source link

emojivoto-demo: vote-bot deployment not coming up, image takes very long to pull #588

Closed blenessy closed 1 week ago

blenessy commented 1 week ago

Using contrast v0.7.0 I tried the emojivoto example again. Everything went smooth (according to the documentation) except that the vote-bot pod never started (confirmed with k9s).

A quick check in emojivoto-demo.yaml (attached) reveals that the io.katacontainers.config.agent.policy is missing from the vote-bot Deployment. Is the vote-bot Deployment suppose to be part of the TCB or not? In the documentation it is rendered in a different color (yellow), which suggests that it is NOT suppose to be trusted. Either way it is not starting properly.

emojivoto-demo.yml.gz

katexochen commented 1 week ago

The vote-bot in the demo simulates a user of the public service, and is not part of the confidential deployment. As such, it doesn't need a policy, and it also doesn't have a runtimeClass set.

Previous to v0.7.0, the policy generation tool hadn't had the ability to filter for runtime class when annotating policies, and thus would even add a policy to the bot, even when it didn't have the Contrast runtimeClass. This is now fixed.

Now trying to understand why the bot isn't starting for you. Can you provide any pod logs/ pods description?

blenessy commented 1 week ago

Hold off please @katexochen . It seems to have started now. It was lagging (minutes) vs the other nodes. I'll try to investigate why the lag occurred ...

blenessy commented 1 week ago

Managed to dump all logs.

logs.tgz

For some reason kubectl describe did not include this event. Fortunately I noted i before:

Name:             vote-bot-64598fd555-z4h4l
...
Events:
  Type    Reason   Age   From     Message
  ----    ------   ----  ----     -------
  Normal  Pulled   51m   kubelet  Successfully pulled image "ghcr.io/3u13r/emojivoto-web:coco-1" in 20m23.926s (20m23.926s including waiting)
  Normal  Created  51m   kubelet  Created container vote-bot
  Normal  Started  51m   kubelet  Started container vote-bot

It looks like it took ~20m to pull ghcr.io/3u13r/emojivoto-web:coco-1, which definitely explains the problem @katexochen .

Quickly checked GitHub Status and there were no issues at the time.

My best guess is that it might be related to the CentralIndia Azure region, which I'm currently testing.

blenessy commented 1 week ago

When trying to reproduce in a fresh AKS cluster, the vote-bot got deployed in nodegroup2 - like the rest of the services and it worked immediately. In my first attempt it got deployed to nodegroup1.

katexochen commented 1 week ago

When trying to reproduce in a fresh AKS cluster, the vote-bot got deployed in nodegroup2 - like the rest of the services and it worked immediately. In my first attempt it got deployed to nodegroup1.

As the pod does not have any specific requirements, it's fine for Kubernetes to schedule it on either node group 1 or 2.

blenessy commented 1 week ago

As nothing points to a problem in contrast - I'm closing this issue.

katexochen commented 1 week ago

Thanks for the report anyway @blenessy. :)