Closed mccormickt closed 2 weeks ago
Currently, cells aren't able to be freed and executables aren't found when attempted to be stopped. Its hard to tell if it can't find the process, or the process has exited too soon and isn't cleaned up. :thinking:
ah i didn't know about that.. i think that should probably be something that doesn't end in case we end up with flaky issues in tests. it means every test needs to clean up but that's probably the right thing to do. do you think that change can go in here?
ah i didn't know about that.. i think that should probably be something that doesn't end in case we end up with flaky issues in tests. it means every test needs to clean up but that's probably the right thing to do. do you think that change can go in here?
Sure, I don't think that should be too much overhead. - To be clear, the request is to remove the default randomized executable name and add names & cell cleanup steps to relevant tests?
ah i didn't know about that.. i think that should probably be something that doesn't end in case we end up with flaky issues in tests. it means every test needs to clean up but that's probably the right thing to do. do you think that change can go in here?
Sure, I don't think that should be too much overhead. - To be clear, the request is to remove the default randomized executable name and add names & cell cleanup steps to relevant tests?
i think the randomized executable name is fine, but the executable that's run should probably not be sleep 400
. we should run something that continues to run until it's stopped i think. wdyt?
did it help fix the issues?
did it help fix the issues?
Sorry for the intermittent communication, but a small update:
Still having issues where the PID is being killed but its child processes not stopped, while still receiving a "PID not found" error. I believe I had isolated it to here where perhaps the process being waited on has already exited, but I haven't had some time to reproduce on a fresh system.
I haven't quite pinned down what is causing this, but it seems to be outside of this PR.
I have been able to reproduce the same test failing on my (x86_64 Ubuntu 24.04) system on multiple past commits as well with the same "process not found" errors:
#[test_helpers_macros::shared_runtime_test]
async fn cells_start_stop_delete() {
skip_if_not_root!("cells_start_stop_delete");
skip_if_seccomp!("cells_start_stop_delete");
let client = common::auraed_client().await;
// Allocate a cell
let cell_name = retry!(
client
.allocate(
common::cells::CellServiceAllocateRequestBuilder::new().build()
)
.await
)
.unwrap()
.into_inner()
.cell_name;
// Start the executable
let req = common::cells::CellServiceStartRequestBuilder::new()
.cell_name(cell_name.clone())
.executable_name("aurae-exe".to_string())
.build();
let _ = retry!(client.start(req.clone()).await).unwrap().into_inner();
// Stop the executable
let _ = retry!(
client
.stop(proto::cells::CellServiceStopRequest {
cell_name: Some(cell_name.clone()),
executable_name: "aurae-exe".to_string(),
})
.await
)
.unwrap();
// Delete the cell
let _ = retry!(
client
.free(proto::cells::CellServiceFreeRequest {
cell_name: cell_name.clone()
})
.await
)
.unwrap();
}
Output:
2024-11-07T01:30:08.046239Z INFO auraed::init::system_runtimes::cell_system_runtime: Running as a cell
2024-11-07T01:30:08.046621Z INFO auraed::init::system_runtimes: User Access Socket Created: /var/run/aurae/aurae-d11742bb-c96f-4
8b9-95b8-afbe6747006b.sock
2024-11-07T01:30:08.051181Z INFO allocate: auraed::cells::cell_service::cells::cell: Attach nested Auraed pid 1668285 to cgroup
ae-e2e-742727f8-938a-4cc1-8a33-d28e7e9465eb request=ValidatedCellServiceAllocateRequest { cell: ValidatedCell { name: CellName("a
e-e2e-742727f8-938a-4cc1-8a33-d28e7e9465eb"), cpu: None, cpuset: None, memory: None, isolate_process: false, isolate_network: fal
se } }
2024-11-07T01:30:08.068934Z INFO start: auraed::cells::cell_service::cell_service: CellService: start() executable=ValidatedExec
utable { name: ExecutableName("aurae-exe"), command: "sleep 400", description: "description" } request=ValidatedCellServiceStartR
equest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "sleep 400", description:
"description" }, uid: None, gid: None }
2024-11-07T01:30:08.069353Z INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stdout request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.069445Z INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stderr request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.103119Z INFO stop: auraed::cells::cell_service::cell_service: CellService: stop() executable_name=Executable
Name("aurae-exe") request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
2024-11-07T01:30:08.103377Z ERROR stop: auraed::cells::cell_service::error: executable 'aurae-exe' failed to stop: No child proce
sses (os error 10) request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
thread 'cells_start_stop_delete' panicked at auraed/tests/cell_list_must_list_allocated_cells_recursively.rs:172:6:
called `Result::unwrap()` on an `Err` value: Status { code: Internal, message: "executable 'aurae-exe' failed to stop: No child p
rocesses (os error 10)", metadata: MetadataMap { headers: {"content-type": "application/grpc", "content-length": "0", "date": "Th
u, 07 Nov 2024 01:30:08 GMT"} }, source: None }
thanks for digging in to the issue.
This PR properly sets up the file mounts and directories, such as
cgroupfs
, for running Cells as the PID1 runtime.Testing
Rust tests, configuring new remote client for nested auraed
sudo -E cargo test -p auraed --test vms_start_must_start_vm_with_auraed -- --include-ignored
Manually with
aer
andcloud-hypervisor
auraed
pid1 imagetap0
(13 in my case):~/.aurae/config
: