eclipse-ankaios / ankaios

Eclipse Ankaios provides workload and container orchestration for automotive High Performance Computing (HPC) software.
https://eclipse-ankaios.github.io/ankaios/
Apache License 2.0
60 stars 18 forks source link

Workload create operation does not block on podman cli #192

Closed inf17101 closed 5 months ago

inf17101 commented 6 months ago

If Ankaios creates a workload through the podman cli (podman run) and receives a fast delete of that workload then the delete is executed before the workload was started, resulting in a too early executed delete operation inside Ankaios agent.

In addition, it must be checked if it is the same non-blocking behavior for podman kube and other runtimes.

Current Behavior

The implementation of workload creation in PodmanCli does not block (https://github.com/eclipse-ankaios/ankaios/blob/main/agent/src/runtime_connectors/podman_cli.rs#L284). The podman run returns immediately the internal workload id of podman in detached mode.

If an image download runs in the background and Ankaios receives a fast delete right after the create, the delete is executed immediately (because the create operation was already finished inside the WorkloadControlLoop due to the immediate return of podman cli in detached mode). The delete operation deletes the control interface, but the create is still running on podman. When podman has finished the image pull it cannot start the workload because the control interface does not exist anymore.

The workload is not started at the end, but the users sees a very strange error message (no such file or directory error) in the logs.

Expected Behavior

The workload shall be deleted and not started if the delete is received. The create operations shall be blocking, so that subsequent delete commands are correctly enqueued into WorkloadControlLoop and only executed after the create was completely finished.

Steps to Reproduce

  1. Make sure the images mentioned in the startup state below are not available locally.
  2. Start the Ankaios server with the startupState.yaml mentioned below.
  3. Start the Ankaios agent agent_A.
  4. Send immediately after the start a delete command of the backend workload: ./ank delete workload backend
  5. The workload is not started because the control interface is deleted and cannot be mount later by podman if it wants to start the workload
  6. The user sees a very strange error message (no such file or directory for the control interface). See screenshot below.

Context (Environment)

Ankaios agent Podman CLI, podman 4.9.2

Logs

image

Additional Information

startupState.yaml

workloads:
  frontend:
    runtime: podman
    agent: agent_A
    restart: true
    updateStrategy: AT_MOST_ONCE
    dependencies:
      backend:
        ADD_COND_RUNNING
    tags:
      - key: owner
        value: Ankaios team
    runtimeConfig: |
      image: docker.io/nginx:latest
      commandOptions: ["-p", "8083:80"]
  backend:
    runtime: podman
    agent: agent_A
    restart: true
    updateStrategy: AT_MOST_ONCE
    tags:
      - key: owner
        value: Ankaios team
    runtimeConfig: |
      image: docker.io/nginx:latest
      commandOptions: ["-p", "8082:80"]

Final result

Currently, no changes required. The podman-run command seems to block even in detach mode. Pls, have a look in the last comments for more details.

inf17101 commented 5 months ago

In the latest podman docs (v5.0.1) https://docs.podman.io/en/v5.0.1/markdown/podman-run.1.html there is not explicitly mentioned that a podman run --detach will block until the container is definitely started.

However, when testing Ankaios with a quick run command, an immediate delete and the run command again, it seems like that the behavior is correct and the end result is correct.

Commands: Ank server:

./ank-server

Ank agent:

./ank-agent --name agent_A

Ank cli:

./ank run workload nginx_1 --runtime podman --agent agent_A --config $'image: docker.io/nginx:latest\ncommandOptions: ["-p", "8081:80"]'; ./ank delete workload nginx_1; ./ank run workload nginx_1 --runtime podman --agent agent_A --config $'image: docker.io/nginx:latest\ncommandOptions: ["-p", "8081:80"]';

It seems like the command is blocking.