aurae-runtime / aurae

Distributed systems runtime daemon written in Rust.
https://aurae.io
Apache License 2.0
1.86k stars 90 forks source link

[cells] PID not found errors when stopping running executables #534

Open mccormickt opened 1 week ago

mccormickt commented 1 week ago

Attempting to stop a running executable seems to have the following behavior on my (x86_64 Ubuntu 24.04) system:

Testing

Rust

Rust tests, configuring new remote client for nested auraed

#[test_helpers_macros::shared_runtime_test]
async fn cells_start_stop_delete() {
    skip_if_not_root!("cells_start_stop_delete");
    skip_if_seccomp!("cells_start_stop_delete");

    let client = common::auraed_client().await;

    // Allocate a cell
    let cell_name = retry!(
        client
            .allocate(
                common::cells::CellServiceAllocateRequestBuilder::new().build()
            )
            .await
    )
    .unwrap()
    .into_inner()
    .cell_name;

    // Start the executable
    let req = common::cells::CellServiceStartRequestBuilder::new()
        .cell_name(cell_name.clone())
        .executable_name("aurae-exe".to_string())
        .build();
    let _ = retry!(client.start(req.clone()).await).unwrap().into_inner();

    // Stop the executable
    let _ = retry!(
        client
            .stop(proto::cells::CellServiceStopRequest {
                cell_name: Some(cell_name.clone()),
                executable_name: "aurae-exe".to_string(),
            })
            .await
    )
    .unwrap();

    // Delete the cell
    let _ = retry!(
        client
            .free(proto::cells::CellServiceFreeRequest {
                cell_name: cell_name.clone()
            })
            .await
    )
    .unwrap();
}
sudo -E cargo test -p auraed --test vms_start_must_start_vm_with_auraed -- --include-ignored
[...snip...]
2024-11-07T01:30:08.068934Z  INFO start: auraed::cells::cell_service::cell_service: CellService: start() executable=ValidatedExec
utable { name: ExecutableName("aurae-exe"), command: "sleep 400", description: "description" } request=ValidatedCellServiceStartR
equest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "sleep 400", description:
 "description" }, uid: None, gid: None }                                                                                         
2024-11-07T01:30:08.069353Z  INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stdout request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.069445Z  INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stderr request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.103119Z  INFO stop: auraed::cells::cell_service::cell_service: CellService: stop() executable_name=Executable
Name("aurae-exe") request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
2024-11-07T01:30:08.103377Z ERROR stop: auraed::cells::cell_service::error: executable 'aurae-exe' failed to stop: No child proce
sses (os error 10) request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
thread 'cells_start_stop_delete' panicked at auraed/tests/cell_list_must_list_allocated_cells_recursively.rs:172:6:             
called `Result::unwrap()` on an `Err` value: Status { code: Internal, message: "executable 'aurae-exe' failed to stop: No child p
rocesses (os error 10)", metadata: MetadataMap { headers: {"content-type": "application/grpc", "content-length": "0", "date": "Th
u, 07 Nov 2024 01:30:08 GMT"} }, source: None }

Manually with aer and cloud-hypervisor

Install cloud-hypervisor and build guest image/kernel

sudo make /opt/aurae/cloud-hypervisor/cloud-hypervisor
sudo make build-guest-kernel
sudo make prepare-image

Run cloud-hypervisor with the auraed pid1 image

sudo cloud-hypervisor --kernel /var/lib/aurae/vm/kernel/vmlinux.bin \                                 
--disk path=/var/lib/aurae/vm/image/disk.raw \                                                                                   
--cmdline "console=hvc0 root=/dev/vda1 rw" \                                                                                     
--cpus boot=4 \                                                                                                                  
--memory size=4096M \                                                                                                            
--net "tap=tap0,mac=aa:ae:00:00:00:01,id=eth0"

Retrieve zone ID from tap0 (13 in my case):

ip link show tap0                                               
13: tap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 06:66:42:a8:3f:e1 brd ff:ff:ff:ff:ff:ff

Configure aurae client config in ~/.aurae/config:

[system]
socket = "[fe80::2%13]:8080"

Verify cells run:

aer cell allocate sleeper
aer cell start --executable-command "sleep 9000" sleeper sleep-forever
aer cell list
aer cell stop sleeper sleep-forever
aer cell free sleeper
dmah42 commented 4 days ago

i'm going to see if i can create a cell service level test for this use case.

dmah42 commented 4 days ago

i have a test that should be testing this scenario, and it works. note though that it's not using nested cells, so i suspect this is where the issue is.

535 is the current draft PR.

next step is to change it to use nested cells instead and see if it starts failing :)

dmah42 commented 2 days ago

well well well.

2024-11-13T11:02:09.173289Z ERROR start_in_cell: auraed::cells::cell_service::error: cgroup 'ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec' exists on host, but is not controlled by auraed cell_name=CellName("ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-f94a9213-518d-40b6-8b66-71a1a67d0f03", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }
11:02:09 [ERROR] failed to start in cell: status: FailedPrecondition, message: "cgroup 'ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec' exists on host, but is not controlled by auraed", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Wed, 13 Nov 2024 11:02:09 GMT", "content-length": "0"} }
error: test failed, to rerun pass `-p auraed --lib`
dmah42 commented 2 days ago

something very odd is going on with the cell cache. i confirmed that we're inserting into the cache on allocate, but when we try to get the cell back out of the cache it isn't there, but the cgroup exists.

out of time for debugging for now but i'll keep hacking on this later.

dmah42 commented 2 days ago

confirmed the cell name is a key in the cache at the moment we call self.cache.get, but this call is returning None.

dmah42 commented 2 days ago

leaving this here as a note to myself:

allocated ae-test-start-stop-in-cell
getting ae-test-start-stop-in-cell from cache
get cell ae-test-start-stop-in-cell
cgroup ae-test-start-stop-in-cell exists
cache size: 1
  CellName("ae-test-start-stop-in-cell")
    MATCH
2024-11-13T13:22:37.869166Z ERROR start_in_cell: auraed::cells::cell_service::cells::cells: get cell ae-test-start-stop-in-cell: cell not in cache cell_name=CellName("ae-test-start-stop-in-cell") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-start-stop-in-cell", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }
2024-11-13T13:22:37.869241Z ERROR start_in_cell: auraed::cells::cell_service::error: cgroup 'ae-test-start-stop-in-cell' exists on host, but is not controlled by auraed cell_name=CellName("ae-test-start-stop-in-cell") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-start-stop-in-cell", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }
dmah42 commented 2 days ago

i think the issue is somewhere between how we "start in cell" and how we "proxy if needed". i'm debating stripping out a lot of the complexity here as i'm not sure it's necessary.