confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
48 stars 88 forks source link

Deploy webhook by default and enable e2e #2066

Closed bpradipt closed 1 month ago

bpradipt commented 1 month ago

@wainersm @stevenhorsman this is an attempt to deploy webhook as part of the installation so that number of peer pods that can be created is constrained by the per node limit. A docker provider specific test is added as well. I can enable the test for other providers but before doing it wanted to get your feedback on the approach.

stevenhorsman commented 1 month ago

Do you think this is needed before 0.10.0 release, or can it wait until after?

bpradipt commented 1 month ago

@stevenhorsman this can wait.

bpradipt commented 1 month ago

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..

stevenhorsman commented 1 month ago

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..

I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷

bpradipt commented 1 month ago

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..

I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷

Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes

stevenhorsman commented 1 month ago

Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes

For what it's worth, I think the image updates should go in and I think it's better than our current code, which is why I made a similar PR, I regret that it was blocked earlier, not that you've done it here.

bpradipt commented 1 month ago

@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..

I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷

Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes

I have removed the commit. And apologise again. I may have thought about moving the image names to a separate file and instead added it as constants to the same file

bpradipt commented 1 month ago

afaiu, there is no CI-enabled libvirt test that would tell us whether the changes work as expected? If we install the webhook as default part of the provisioner, would it make sense to execute a simple test in the libvirt e2e suite?

Yeah, makes sense. I'll add one test for libvirt

mkulke commented 1 month ago

there seems to be a problem with the cert-manager installation, see test run

stevenhorsman commented 1 month ago

there seems to be a problem with the cert-manager installation, see test run

I tried running this manually and it seemed to work, so I'll re-run it just incase there was just some slowness in the CI

stevenhorsman commented 1 month ago

So in my gh hosted runner trial, I added trace output and got the following error:

time="2024-10-17T12:43:04Z" level=info msg="Installing cert-manager"
time="2024-10-17T12:43:07Z" level=trace msg="/usr/bin/make -C ../webhook deploy-cert-manager, output: make[1]: Entering directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp [::1]:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n"
F1017 12:43:07.161828   18745 env.go:369] Setup failure: exit status 2
FAIL    github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor/test/e2e 419.964s

however I'm not sure if this is helpful and the same issue we might be hitting here as the gh-runner is tight on disk space

bpradipt commented 1 month ago

So in my gh hosted runner trial, I added trace output and got the following error:

time="2024-10-17T12:43:04Z" level=info msg="Installing cert-manager"
time="2024-10-17T12:43:07Z" level=trace msg="/usr/bin/make -C ../webhook deploy-cert-manager, output: make[1]: Entering directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp [::1]:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n"
F1017 12:43:07.161828   18745 env.go:369] Setup failure: exit status 2
FAIL  github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor/test/e2e 419.964s

however I'm not sure if this is helpful and the same issue we might be hitting here as the gh-runner is tight on disk space

May be something is different in the way gh action is running, that's why we see the error only in gh action workflow. I have added a commit to install cert-manager via kcli_cluster.sh. Let's see.

bpradipt commented 1 month ago

With this different flow, at least the error seems to be clear

mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
error: timed out waiting for the condition on endpoints/cert-manager

Retrying with increased timeout for cert-manager deployment

bpradipt commented 1 month ago

With this different flow, at least the error seems to be clear

mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
error: timed out waiting for the condition on endpoints/cert-manager

Retrying with increased timeout for cert-manager deployment

Initial install of cert-manager succeeded. However when running make -C ../webhook deploy-cert-manager via provision.go failed. This mostly reapplies the yamls which is pretty quick and should have succeeded. Increased the kubectl wait timeout and rechecking.

bpradipt commented 1 month ago

@stevenhorsman Now I see this error "Error from server: error when creating "https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml": etcdserver: request timed out" Is installed cert-manager overwhelming the k8s cluster and the config needs to be changed?

stevenhorsman commented 1 month ago

@stevenhorsman Now I see this error "Error from server: error when creating "https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml": etcdserver: request timed out" Is installed cert-manager overwhelming the k8s cluster and the config needs to be changed?

I think it's more likely that the runner is struggling - the default kcli set-up is 4 vCPU and 6GB RAM for each of the nodes, but the az-ubuntu-2204 runner only has 4vCPUs available, so maybe the extra stuff means kcli is starting to try and use more that is available?

bpradipt commented 1 month ago

@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs. What's the way forward ?

stevenhorsman commented 1 month ago

@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs. What's the way forward ?

Could we try using a Standard_D8_v4 VM instead to see if that helps as it might help us rule in/out the resource pressure theory. I guess that would need to updated in garm?

We good also try testing it locally on a 4 vCPU VM to see if we reproduce the errors there?

bpradipt commented 1 month ago

@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs. What's the way forward ?

Could we try using a Standard_D8_v4 VM instead to see if that helps as it might help us rule in/out the resource pressure theory. I guess that would need to updated in garm?

We good also try testing it locally on a 4 vCPU VM to see if we reproduce the errors there?

Do you have the cpu and memory spec of the runner?

stevenhorsman commented 1 month ago

Do you have the cpu and memory spec of the runner?

The current runner is: https://cloudprice.net/vm/Standard_D4s_v4

mkulke commented 1 month ago

yes, I guess we can do that. but I'm not sure we want the instance to be bumped to 2x the size, I think the same runner pool is also used by a different jobs. we could create a discrete pool az-ubuntu2204-large or something and switch the runner type on the libvirt workflow.

but that puts our plans to switch to github-hosted runners on hold I suppose? I'm a bit surprise to see that we cannot make it work with 16gb of ram.

stevenhorsman commented 1 month ago

yes, I guess we can do that. but I'm not sure we want the instance to be bumped to 2x the size, I think the same runner pool is also used by a different jobs. we could create a discrete pool az-ubuntu2204-large or something and switch the runner type on the libvirt workflow.

but that puts our plans to switch to github-hosted runners on hold I suppose? I'm a bit surprise to see that we cannot make it work with 16gb of ram.

I don't know whether the CPU, or RAM is the bottleneck, so I figured just trying 8 and 32GB would let us know if it's resource at all and then (if we can find matching profiles) trying 8 vCPU and 16GB and/or 4 vCPU and 32GB RAM would let us know which one is under pressure. I can try those combinations next week with VMs manually though to see if that gives more info.

mkulke commented 1 month ago

I don't know whether the CPU, or RAM is the bottleneck, so I figured just trying 8 and 32GB would let us know if it's resource at all and then (if we can find matching profiles) trying 8 vCPU and 16GB and/or 4 vCPU and 32GB RAM would let us know which one is under pressure. I can try those combinations next week with VMs manually though to see if that gives more info.

I bumped the pools instance type to Standard_D8s_v4

mkulke commented 1 month ago

that didn't help when re-running the test. I would suggest to copy the workflow, throw everything coco-related out and just try to install cert-manager after kubeadm set up the cluster.

image

stevenhorsman commented 1 month ago

I also checked it locally on a 4 CPU, 16GB VM locally and the cert-manager install worked fine there, so maybe the resource usage idea is a bust?

bpradipt commented 1 month ago

I also checked it locally on a 4 CPU, 16GB VM locally and the cert-manager install worked fine there, so maybe the resource usage idea is a bust?

I have also been unable to recreate this locally. The reason for guessing it's due to resource availability is because of etcd server timeout error: https://github.com/confidential-containers/cloud-api-adaptor/pull/2066#issuecomment-2422275138

@mkulke in one of the previous runs, I had installed cert-manager just after kubeadm cluster install. But the CI provisioner still failed during checking for the service endpoints https://github.com/confidential-containers/cloud-api-adaptor/pull/2066#issuecomment-2421636234

I added more logs and some hacks to retry. Let's see..

bpradipt commented 1 month ago

I logged the errors instead of using Trace, and see this. It's trying to access localhost which might indicate issue with kubeconfig.

time="2024-10-21T13:55:30Z" level=info msg="Error  in install cert-manager: exit status 2: make[1]: Entering directory '/home/runner/actions-runner/_work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/actions-runner/_work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n"
bpradipt commented 1 month ago

@mkulke @stevenhorsman the KUBECONFIG setting was the culprit. The previous run was successful. Redoing it again by tidying up the commits.