Closed bpradipt closed 1 month ago
Do you think this is needed before 0.10.0 release, or can it wait until after?
@stevenhorsman this can wait.
@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..
@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..
I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷
@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..
I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷
Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes
Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes
For what it's worth, I think the image updates should go in and I think it's better than our current code, which is why I made a similar PR, I regret that it was blocked earlier, not that you've done it here.
@stevenhorsman @wainersm I added a commit to use const for some test images and using test images from quay to avoid the rate limiting. I can create a separate PR as well but thought of including with this as it's being reviewed. Let me know if you want me to create a separate PR for the above change..
I'm not against it, but I'm pretty disappointed that I made a similar change back in July and you didn't approve it as you wanted them moved out to a separate file and then when I worked on it and hit problems I was ignored 🤷
Oh sorry about it. I faced so much issues with the docker rate limit that I had to manually create and push the image to quay. I'll remove the changes
I have removed the commit. And apologise again. I may have thought about moving the image names to a separate file and instead added it as constants to the same file
afaiu, there is no CI-enabled libvirt test that would tell us whether the changes work as expected? If we install the webhook as default part of the provisioner, would it make sense to execute a simple test in the libvirt e2e suite?
Yeah, makes sense. I'll add one test for libvirt
there seems to be a problem with the cert-manager installation, see test run
there seems to be a problem with the cert-manager installation, see test run
I tried running this manually and it seemed to work, so I'll re-run it just incase there was just some slowness in the CI
So in my gh hosted runner trial, I added trace output and got the following error:
time="2024-10-17T12:43:04Z" level=info msg="Installing cert-manager"
time="2024-10-17T12:43:07Z" level=trace msg="/usr/bin/make -C ../webhook deploy-cert-manager, output: make[1]: Entering directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp [::1]:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n"
F1017 12:43:07.161828 18745 env.go:369] Setup failure: exit status 2
FAIL github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor/test/e2e 419.964s
however I'm not sure if this is helpful and the same issue we might be hitting here as the gh-runner is tight on disk space
So in my gh hosted runner trial, I added trace output and got the following error:
time="2024-10-17T12:43:04Z" level=info msg="Installing cert-manager" time="2024-10-17T12:43:07Z" level=trace msg="/usr/bin/make -C ../webhook deploy-cert-manager, output: make[1]: Entering directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp [::1]:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n" F1017 12:43:07.161828 18745 env.go:369] Setup failure: exit status 2 FAIL github.com/confidential-containers/cloud-api-adaptor/src/cloud-api-adaptor/test/e2e 419.964s
however I'm not sure if this is helpful and the same issue we might be hitting here as the gh-runner is tight on disk space
May be something is different in the way gh action is running, that's why we see the error only in gh action workflow. I have added a commit to install cert-manager via kcli_cluster.sh. Let's see.
With this different flow, at least the error seems to be clear
mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
error: timed out waiting for the condition on endpoints/cert-manager
Retrying with increased timeout for cert-manager deployment
With this different flow, at least the error seems to be clear
mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created error: timed out waiting for the condition on endpoints/cert-manager
Retrying with increased timeout for cert-manager deployment
Initial install of cert-manager succeeded. However when running make -C ../webhook deploy-cert-manager
via provision.go failed. This mostly reapplies the yamls which is pretty quick and should have succeeded. Increased the kubectl wait timeout and rechecking.
@stevenhorsman Now I see this error "Error from server: error when creating "https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml": etcdserver: request timed out" Is installed cert-manager overwhelming the k8s cluster and the config needs to be changed?
@stevenhorsman Now I see this error "Error from server: error when creating "https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml": etcdserver: request timed out" Is installed cert-manager overwhelming the k8s cluster and the config needs to be changed?
I think it's more likely that the runner is struggling - the default kcli set-up is 4 vCPU and 6GB RAM for each of the nodes, but the az-ubuntu-2204 runner only has 4vCPUs available, so maybe the extra stuff means kcli is starting to try and use more that is available?
@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs. What's the way forward ?
@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs. What's the way forward ?
Could we try using a Standard_D8_v4 VM instead to see if that helps as it might help us rule in/out the resource pressure theory. I guess that would need to updated in garm?
We good also try testing it locally on a 4 vCPU VM to see if we reproduce the errors there?
@stevenhorsman @mkulke tried few options and nothing works in CI w.r.to setting up cert-manager with libvirt. Looks like it's something to do with resource availability (see earlier errors on etcd server timeout). No issues in local runs. What's the way forward ?
Could we try using a Standard_D8_v4 VM instead to see if that helps as it might help us rule in/out the resource pressure theory. I guess that would need to updated in garm?
We good also try testing it locally on a 4 vCPU VM to see if we reproduce the errors there?
Do you have the cpu and memory spec of the runner?
Do you have the cpu and memory spec of the runner?
The current runner is: https://cloudprice.net/vm/Standard_D4s_v4
yes, I guess we can do that. but I'm not sure we want the instance to be bumped to 2x the size, I think the same runner pool is also used by a different jobs. we could create a discrete pool az-ubuntu2204-large or something and switch the runner type on the libvirt workflow.
but that puts our plans to switch to github-hosted runners on hold I suppose? I'm a bit surprise to see that we cannot make it work with 16gb of ram.
yes, I guess we can do that. but I'm not sure we want the instance to be bumped to 2x the size, I think the same runner pool is also used by a different jobs. we could create a discrete pool az-ubuntu2204-large or something and switch the runner type on the libvirt workflow.
but that puts our plans to switch to github-hosted runners on hold I suppose? I'm a bit surprise to see that we cannot make it work with 16gb of ram.
I don't know whether the CPU, or RAM is the bottleneck, so I figured just trying 8 and 32GB would let us know if it's resource at all and then (if we can find matching profiles) trying 8 vCPU and 16GB and/or 4 vCPU and 32GB RAM would let us know which one is under pressure. I can try those combinations next week with VMs manually though to see if that gives more info.
I don't know whether the CPU, or RAM is the bottleneck, so I figured just trying 8 and 32GB would let us know if it's resource at all and then (if we can find matching profiles) trying 8 vCPU and 16GB and/or 4 vCPU and 32GB RAM would let us know which one is under pressure. I can try those combinations next week with VMs manually though to see if that gives more info.
I bumped the pools instance type to Standard_D8s_v4
that didn't help when re-running the test. I would suggest to copy the workflow, throw everything coco-related out and just try to install cert-manager after kubeadm set up the cluster.
I also checked it locally on a 4 CPU, 16GB VM locally and the cert-manager install worked fine there, so maybe the resource usage idea is a bust?
I also checked it locally on a 4 CPU, 16GB VM locally and the cert-manager install worked fine there, so maybe the resource usage idea is a bust?
I have also been unable to recreate this locally. The reason for guessing it's due to resource availability is because of etcd server timeout error: https://github.com/confidential-containers/cloud-api-adaptor/pull/2066#issuecomment-2422275138
@mkulke in one of the previous runs, I had installed cert-manager just after kubeadm cluster install. But the CI provisioner still failed during checking for the service endpoints https://github.com/confidential-containers/cloud-api-adaptor/pull/2066#issuecomment-2421636234
I added more logs and some hacks to retry. Let's see..
I logged the errors instead of using Trace, and see this. It's trying to access localhost which might indicate issue with kubeconfig.
time="2024-10-21T13:55:30Z" level=info msg="Error in install cert-manager: exit status 2: make[1]: Entering directory '/home/runner/actions-runner/_work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\ncurl -fsSL -o cmctl [https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64\nchmod](https://github.com/cert-manager/cmctl/releases/latest/download/cmctl_linux_amd64/nchmod) +x cmctl\n# Deploy cert-manager\nkubectl apply -f [https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\nerror:](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/nerror:) error validating \"[https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml\](https://github.com/jetstack/cert-manager/releases/download/v1.15.3/cert-manager.yaml/)": error validating data: failed to download openapi: Get \"http://localhost:8080/openapi/v2?timeout=32s\": dial tcp 127.0.0.1:8080: connect: connection refused; if you choose to ignore these errors, turn validation off with --validate=false\nmake[1]: *** [Makefile:130: deploy-cert-manager] Error 1\nmake[1]: Leaving directory '/home/runner/actions-runner/_work/cloud-api-adaptor/cloud-api-adaptor/src/webhook'\n"
@mkulke @stevenhorsman the KUBECONFIG setting was the culprit. The previous run was successful. Redoing it again by tidying up the commits.
@wainersm @stevenhorsman this is an attempt to deploy webhook as part of the installation so that number of peer pods that can be created is constrained by the per node limit. A docker provider specific test is added as well. I can enable the test for other providers but before doing it wanted to get your feedback on the approach.