Open stevenhorsman opened 1 month ago
@wainersm, @mkulke do you know the size of the az-ubuntu-2204
(that I think are created by garm) at the moment as a reference point?
@wainersm, @mkulke do you know the size of the
az-ubuntu-2204
(that I think are created by garm) at the moment as a reference point?
https://cloudprice.net/vm/Standard_D4s_v4
Currently we run the tests on a 4 vCPU 16gb ram machine.
That's very interesting as if it's a 4x16 machine then it's the same size as the github hosted runners (and might explain some of the libvirt ci flakiness as we try and squeeze 10 vCPUs and 20GB RAM out of a 4x16 box! Maybe we can try out libvirt e2e on a self-hosted runner now... (I'll be back with results)
It failed (https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11331308718/job/31511037253) with:
time="2024-10-14T16:25:22Z" level=info msg="Installing peerpod-ctrl"
F1014 16:25:22.856817 22292 env.go:369] Setup failure: exit status 2
Error: No space left on device : '/home/runner/runners/2.320.0/_diag/pages/ca098555-ffe8-4649-a147-31a3909d57d3_9f5eba70-d33b-5377-96f2-d94c82946629_1.log'
The GH runners have 14GB of storage, so maybe that isn't enough, so it might be another path to investigate
It failed (https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11331308718/job/31511037253) with:
time="2024-10-14T16:25:22Z" level=info msg="Installing peerpod-ctrl" F1014 16:25:22.856817 22292 env.go:369] Setup failure: exit status 2 Error: No space left on device : '/home/runner/runners/2.320.0/_diag/pages/ca098555-ffe8-4649-a147-31a3909d57d3_9f5eba70-d33b-5377-96f2-d94c82946629_1.log'
The GH runners have 14GB of storage, so maybe that isn't enough, so it might be another path to investigate
currently we build the kbs client with rust, which can produce a surprisingly large target folder. we could either clean that up or download the kbs-client via oras?
download the kbs-client via oras
Yeah, I think that would be great. I'll try out the e2e tests without the KBS section and see if that helps and also re-run it once the caching PR is merged 😃
Ooh - cutting out the KBS deployment and test meant the gh-runner tests worked: https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11331632234/job/31512084686 😃
that's great. if we change this line:
to
oras pull "ghcr.io/confidential-containers/staged-images/kbs-client:sample_only-x86_64-linux-gnu-${KBS_SHA}"
chmod +x ./kbs-client
we can also drop the rust toolchain installation
Cool - I'll give that a try in my fork
TestLibvirtKbsKeyRelease/KbsKeyReleasePod_test
failed trying that approach: https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11342464633/job/31542928046
When I get a chance I'll try and re-create and debug locally
TestLibvirtKbsKeyRelease/KbsKeyReleasePod_test
failed trying that approach: https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11342464633/job/31542928046When I get a chance I'll try and re-create and debug locally
hmm I've never tested it either, the whole kbs-client business is a bit of a black box to me, so the available binary might not work. an alternative would be to build the kbs-client in another job and pass it around as an artifact.
upside: faster builds, b/c it can be built in parallel before the test; doesn't consume space on the test instance downside: checkout-kbs.sh needs untangling, since it performs both cloning of the kbs repo and building of the kbs client
Ah - the kbs client is extracted to the wrong directory as it's build to targets/release. I'll try and fix that and see if that helps. I also want to understand why we aren't hitting errors in the e2e tests first trying to use a non-existing file?
I also want to understand why we aren't hitting errors in the e2e tests first trying to use a non-existing file?
So we just ignore any errors thrown in the kbs client code. I thought I remembered fixing that, but https://github.com/confidential-containers/cloud-api-adaptor/pull/2055/files hasn't merged yet.
(T)he kbs client is extracted to the wrong directory as it's build to targets/release.
I think we just move the expectation for the client to be in kbs
directly, but I'm not sure if anyone would still want to build their own version of it
At the moment in the libvirt testing we are using the default node size. This leads to the situation were each of the work and control-plane defaults uses 4 vCPU and 6GB RAM:
In an ideal world we'd like to reduce our test footprint to fit inside the github hosted runner, which is a 4x16GB machine.
Our peer pod VM is currently using 2x8GB of it's own, which we are working on reducing, but the 8 vCPU and 12 GB RAM that the kcli cluster uses is way to big. Actually reducing this shouldn't be too tricky as I think it's just editing the default parameters we pass in in
kcli_cluster.sh
, but the tricky bit is working out the minimum resources we can get away with without impacting the tests, so looking at the resource usage on an existing cluster might help there.