Open mccormickt opened 1 week ago
i'm going to see if i can create a cell service level test for this use case.
i have a test that should be testing this scenario, and it works. note though that it's not using nested cells, so i suspect this is where the issue is.
next step is to change it to use nested cells instead and see if it starts failing :)
well well well.
2024-11-13T11:02:09.173289Z ERROR start_in_cell: auraed::cells::cell_service::error: cgroup 'ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec' exists on host, but is not controlled by auraed cell_name=CellName("ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-f94a9213-518d-40b6-8b66-71a1a67d0f03", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }
11:02:09 [ERROR] failed to start in cell: status: FailedPrecondition, message: "cgroup 'ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec' exists on host, but is not controlled by auraed", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Wed, 13 Nov 2024 11:02:09 GMT", "content-length": "0"} }
error: test failed, to rerun pass `-p auraed --lib`
something very odd is going on with the cell cache. i confirmed that we're inserting into the cache on allocate, but when we try to get the cell back out of the cache it isn't there, but the cgroup exists.
out of time for debugging for now but i'll keep hacking on this later.
confirmed the cell name is a key in the cache at the moment we call self.cache.get
, but this call is returning None
.
leaving this here as a note to myself:
allocated ae-test-start-stop-in-cell
getting ae-test-start-stop-in-cell from cache
get cell ae-test-start-stop-in-cell
cgroup ae-test-start-stop-in-cell exists
cache size: 1
CellName("ae-test-start-stop-in-cell")
MATCH
2024-11-13T13:22:37.869166Z ERROR start_in_cell: auraed::cells::cell_service::cells::cells: get cell ae-test-start-stop-in-cell: cell not in cache cell_name=CellName("ae-test-start-stop-in-cell") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-start-stop-in-cell", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }
2024-11-13T13:22:37.869241Z ERROR start_in_cell: auraed::cells::cell_service::error: cgroup 'ae-test-start-stop-in-cell' exists on host, but is not controlled by auraed cell_name=CellName("ae-test-start-stop-in-cell") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-start-stop-in-cell", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }
i think the issue is somewhere between how we "start in cell" and how we "proxy if needed". i'm debating stripping out a lot of the complexity here as i'm not sure it's necessary.
Attempting to stop a running executable seems to have the following behavior on my (x86_64 Ubuntu 24.04) system:
sh -c <executable>
process is created and has its PID tracked in theExecutables
cache.<executable>
process is not tracked.PID not found
Testing
Rust
Rust tests, configuring new remote client for nested auraed
Manually with aer and cloud-hypervisor
Install cloud-hypervisor and build guest image/kernel
Run cloud-hypervisor with the auraed pid1 image
Retrieve zone ID from tap0 (13 in my case):
Configure aurae client config in ~/.aurae/config:
Verify cells run: