aurae-runtime / aurae

Distributed systems runtime daemon written in Rust.
https://aurae.io
Apache License 2.0
1.86k stars 90 forks source link

Fix: Setup cgroupfs to run cells when running as PID1 #533

Closed mccormickt closed 2 weeks ago

mccormickt commented 1 month ago

This PR properly sets up the file mounts and directories, such as cgroupfs, for running Cells as the PID1 runtime.

Testing

Rust tests, configuring new remote client for nested auraed

sudo -E cargo test -p auraed --test vms_start_must_start_vm_with_auraed -- --include-ignored

Manually with aer and cloud-hypervisor

  1. Install cloud-hypervisor and build guest image/kernel
    sudo make /opt/aurae/cloud-hypervisor/cloud-hypervisor
    sudo make build-guest-kernel
    sudo make prepare-image
  2. Run cloud-hypervisor with the auraed pid1 image
    sudo cloud-hypervisor --kernel /var/lib/aurae/vm/kernel/vmlinux.bin \                                 
    --disk path=/var/lib/aurae/vm/image/disk.raw \                                                                                   
    --cmdline "console=hvc0 root=/dev/vda1 rw" \                                                                                     
    --cpus boot=4 \                                                                                                                  
    --memory size=4096M \                                                                                                            
    --net "tap=tap0,mac=aa:ae:00:00:00:01,id=eth0"
  3. Retrieve zone ID from tap0 (13 in my case):
    ip link show tap0                                               
    13: tap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 06:66:42:a8:3f:e1 brd ff:ff:ff:ff:ff:ff
  4. Configure aurae client config in ~/.aurae/config:
    [system]
    socket = "[fe80::2%13]:8080"
  5. Verify cells run:
    aer cell allocate sleeper
    aer cell start --executable-command "sleep 9000" sleeper sleep-forever
    aer cell list
    aer cell stop sleeper sleep-forever
    aer cell free sleeper
mccormickt commented 1 month ago

Currently, cells aren't able to be freed and executables aren't found when attempted to be stopped. Its hard to tell if it can't find the process, or the process has exited too soon and isn't cleaned up. :thinking:

dmah42 commented 1 month ago

ah i didn't know about that.. i think that should probably be something that doesn't end in case we end up with flaky issues in tests. it means every test needs to clean up but that's probably the right thing to do. do you think that change can go in here?

mccormickt commented 1 month ago

ah i didn't know about that.. i think that should probably be something that doesn't end in case we end up with flaky issues in tests. it means every test needs to clean up but that's probably the right thing to do. do you think that change can go in here?

Sure, I don't think that should be too much overhead. - To be clear, the request is to remove the default randomized executable name and add names & cell cleanup steps to relevant tests?

dmah42 commented 1 month ago

ah i didn't know about that.. i think that should probably be something that doesn't end in case we end up with flaky issues in tests. it means every test needs to clean up but that's probably the right thing to do. do you think that change can go in here?

Sure, I don't think that should be too much overhead. - To be clear, the request is to remove the default randomized executable name and add names & cell cleanup steps to relevant tests?

i think the randomized executable name is fine, but the executable that's run should probably not be sleep 400. we should run something that continues to run until it's stopped i think. wdyt?

dmah42 commented 1 month ago

did it help fix the issues?

mccormickt commented 1 month ago

did it help fix the issues?

Sorry for the intermittent communication, but a small update:

Still having issues where the PID is being killed but its child processes not stopped, while still receiving a "PID not found" error. I believe I had isolated it to here where perhaps the process being waited on has already exited, but I haven't had some time to reproduce on a fresh system.

mccormickt commented 3 weeks ago

I haven't quite pinned down what is causing this, but it seems to be outside of this PR.

I have been able to reproduce the same test failing on my (x86_64 Ubuntu 24.04) system on multiple past commits as well with the same "process not found" errors:

#[test_helpers_macros::shared_runtime_test]
async fn cells_start_stop_delete() {
    skip_if_not_root!("cells_start_stop_delete");
    skip_if_seccomp!("cells_start_stop_delete");

    let client = common::auraed_client().await;

    // Allocate a cell
    let cell_name = retry!(
        client
            .allocate(
                common::cells::CellServiceAllocateRequestBuilder::new().build()
            )
            .await
    )
    .unwrap()
    .into_inner()
    .cell_name;

    // Start the executable
    let req = common::cells::CellServiceStartRequestBuilder::new()
        .cell_name(cell_name.clone())
        .executable_name("aurae-exe".to_string())
        .build();
    let _ = retry!(client.start(req.clone()).await).unwrap().into_inner();

    // Stop the executable
    let _ = retry!(
        client
            .stop(proto::cells::CellServiceStopRequest {
                cell_name: Some(cell_name.clone()),
                executable_name: "aurae-exe".to_string(),
            })
            .await
    )
    .unwrap();

    // Delete the cell
    let _ = retry!(
        client
            .free(proto::cells::CellServiceFreeRequest {
                cell_name: cell_name.clone()
            })
            .await
    )
    .unwrap();
}

Output:

2024-11-07T01:30:08.046239Z  INFO auraed::init::system_runtimes::cell_system_runtime: Running as a cell
2024-11-07T01:30:08.046621Z  INFO auraed::init::system_runtimes: User Access Socket Created: /var/run/aurae/aurae-d11742bb-c96f-4
8b9-95b8-afbe6747006b.sock                                                                                                       
2024-11-07T01:30:08.051181Z  INFO allocate: auraed::cells::cell_service::cells::cell: Attach nested Auraed pid 1668285 to cgroup 
ae-e2e-742727f8-938a-4cc1-8a33-d28e7e9465eb request=ValidatedCellServiceAllocateRequest { cell: ValidatedCell { name: CellName("a
e-e2e-742727f8-938a-4cc1-8a33-d28e7e9465eb"), cpu: None, cpuset: None, memory: None, isolate_process: false, isolate_network: fal
se } }                                                          
2024-11-07T01:30:08.068934Z  INFO start: auraed::cells::cell_service::cell_service: CellService: start() executable=ValidatedExec
utable { name: ExecutableName("aurae-exe"), command: "sleep 400", description: "description" } request=ValidatedCellServiceStartR
equest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "sleep 400", description:
 "description" }, uid: None, gid: None }                                                                                         
2024-11-07T01:30:08.069353Z  INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stdout request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.069445Z  INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stderr request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.103119Z  INFO stop: auraed::cells::cell_service::cell_service: CellService: stop() executable_name=Executable
Name("aurae-exe") request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
2024-11-07T01:30:08.103377Z ERROR stop: auraed::cells::cell_service::error: executable 'aurae-exe' failed to stop: No child proce
sses (os error 10) request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
thread 'cells_start_stop_delete' panicked at auraed/tests/cell_list_must_list_allocated_cells_recursively.rs:172:6:             
called `Result::unwrap()` on an `Err` value: Status { code: Internal, message: "executable 'aurae-exe' failed to stop: No child p
rocesses (os error 10)", metadata: MetadataMap { headers: {"content-type": "application/grpc", "content-length": "0", "date": "Th
u, 07 Nov 2024 01:30:08 GMT"} }, source: None }
dmah42 commented 2 weeks ago

thanks for digging in to the issue.