confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
48 stars 83 forks source link

test/e2e: libvirt: Try and reduce the resource usage of the kcli cluster #2117

Open stevenhorsman opened 1 week ago

stevenhorsman commented 1 week ago

At the moment in the libvirt testing we are using the default node size. This leads to the situation were each of the work and control-plane defaults uses 4 vCPU and 6GB RAM:

# kcli info vm peer-pods-worker-0
name: peer-pods-worker-0
id: a5ce3795-a67c-4daf-854d-62df6736081b
creationdate: 09-10-2024 10:35
status: up
autostart: False
image: ubuntu2204
user: ubuntu
plan: peer-pods
profile: kvirt
numcpus: 4
memory: 6144

In an ideal world we'd like to reduce our test footprint to fit inside the github hosted runner, which is a 4x16GB machine.

Our peer pod VM is currently using 2x8GB of it's own, which we are working on reducing, but the 8 vCPU and 12 GB RAM that the kcli cluster uses is way to big. Actually reducing this shouldn't be too tricky as I think it's just editing the default parameters we pass in in kcli_cluster.sh, but the tricky bit is working out the minimum resources we can get away with without impacting the tests, so looking at the resource usage on an existing cluster might help there.

stevenhorsman commented 1 week ago

@wainersm, @mkulke do you know the size of the az-ubuntu-2204 (that I think are created by garm) at the moment as a reference point?

mkulke commented 1 week ago

@wainersm, @mkulke do you know the size of the az-ubuntu-2204 (that I think are created by garm) at the moment as a reference point?

https://cloudprice.net/vm/Standard_D4s_v4

Currently we run the tests on a 4 vCPU 16gb ram machine.

stevenhorsman commented 1 week ago

https://cloudprice.net/vm/Standard_D4s_v4

That's very interesting as if it's a 4x16 machine then it's the same size as the github hosted runners (and might explain some of the libvirt ci flakiness as we try and squeeze 10 vCPUs and 20GB RAM out of a 4x16 box! Maybe we can try out libvirt e2e on a self-hosted runner now... (I'll be back with results)

stevenhorsman commented 1 week ago

It failed (https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11331308718/job/31511037253) with:

time="2024-10-14T16:25:22Z" level=info msg="Installing peerpod-ctrl"
F1014 16:25:22.856817   22292 env.go:369] Setup failure: exit status 2
Error: No space left on device : '/home/runner/runners/2.320.0/_diag/pages/ca098555-ffe8-4649-a147-31a3909d57d3_9f5eba70-d33b-5377-96f2-d94c82946629_1.log'

The GH runners have 14GB of storage, so maybe that isn't enough, so it might be another path to investigate

mkulke commented 1 week ago

It failed (https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11331308718/job/31511037253) with:

time="2024-10-14T16:25:22Z" level=info msg="Installing peerpod-ctrl"
F1014 16:25:22.856817   22292 env.go:369] Setup failure: exit status 2
Error: No space left on device : '/home/runner/runners/2.320.0/_diag/pages/ca098555-ffe8-4649-a147-31a3909d57d3_9f5eba70-d33b-5377-96f2-d94c82946629_1.log'

The GH runners have 14GB of storage, so maybe that isn't enough, so it might be another path to investigate

currently we build the kbs client with rust, which can produce a surprisingly large target folder. we could either clean that up or download the kbs-client via oras?

stevenhorsman commented 1 week ago

download the kbs-client via oras

Yeah, I think that would be great. I'll try out the e2e tests without the KBS section and see if that helps and also re-run it once the caching PR is merged 😃

stevenhorsman commented 6 days ago

Ooh - cutting out the KBS deployment and test meant the gh-runner tests worked: https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11331632234/job/31512084686 😃

mkulke commented 6 days ago

that's great. if we change this line:

https://github.com/confidential-containers/cloud-api-adaptor/blob/7a784255db24d55578783e4d025f9aea114d5819/src/cloud-api-adaptor/test/utils/checkout_kbs.sh#L28

to

oras pull "ghcr.io/confidential-containers/staged-images/kbs-client:sample_only-x86_64-linux-gnu-${KBS_SHA}"
chmod +x ./kbs-client

we can also drop the rust toolchain installation

stevenhorsman commented 6 days ago

Cool - I'll give that a try in my fork

stevenhorsman commented 6 days ago

TestLibvirtKbsKeyRelease/KbsKeyReleasePod_test failed trying that approach: https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11342464633/job/31542928046

When I get a chance I'll try and re-create and debug locally

mkulke commented 6 days ago

TestLibvirtKbsKeyRelease/KbsKeyReleasePod_test failed trying that approach: https://github.com/stevenhorsman/cloud-api-adaptor/actions/runs/11342464633/job/31542928046

When I get a chance I'll try and re-create and debug locally

hmm I've never tested it either, the whole kbs-client business is a bit of a black box to me, so the available binary might not work. an alternative would be to build the kbs-client in another job and pass it around as an artifact.

upside: faster builds, b/c it can be built in parallel before the test; doesn't consume space on the test instance downside: checkout-kbs.sh needs untangling, since it performs both cloning of the kbs repo and building of the kbs client

stevenhorsman commented 6 days ago

Ah - the kbs client is extracted to the wrong directory as it's build to targets/release. I'll try and fix that and see if that helps. I also want to understand why we aren't hitting errors in the e2e tests first trying to use a non-existing file?

stevenhorsman commented 6 days ago

I also want to understand why we aren't hitting errors in the e2e tests first trying to use a non-existing file?

So we just ignore any errors thrown in the kbs client code. I thought I remembered fixing that, but https://github.com/confidential-containers/cloud-api-adaptor/pull/2055/files hasn't merged yet.

(T)he kbs client is extracted to the wrong directory as it's build to targets/release.

I think we just move the expectation for the client to be in kbs directly, but I'm not sure if anyone would still want to build their own version of it