eclipse-ankaios / ankaios

Eclipse Ankaios provides workload and container orchestration for automotive High Performance Computing (HPC) software.
https://eclipse-ankaios.github.io/ankaios/
Apache License 2.0
60 stars 22 forks source link

Re-use existing container bundles on device re-start #260

Open windsource opened 6 months ago

windsource commented 6 months ago

Description

Currently as implemented in #5 existing containers in status "exited" will be deleted and re-created on device restart. Measurements have revealed that the creation of a container bundle takes about ~50% of the whole startup time. So starting an existing container bundle will save about 50% of the time compared to deleting and re-creating the bundle.

As startup time is crucial in automotive for certain applications, this issue should collect pros and cons to re-use existing container bundles in order speed-up startup.

Currently the procedure on agent start is something like

  1. Call podman ps -a to get all existing containers
  2. For each container matching the state
    1. If it is running keep it running
    2. If it is exited and shall be running call podman rm ... followed by podman run.

The proposed change is:

  1. Call podman ps -a to get all existing containers
  2. For each container matching the state
    1. If it is running keep it running
    2. If it is exited and shall be running call podman start ....

Pros:

Cons:

Goals

Increase startup time for containers when container bundle already exists on disk.

Final result

Summary

To be filled when the final solution is sketched.

Tasks

inf17101 commented 5 months ago

With the fix introduced in #261, the workload restarts after full device restart might be delayed. The reason is that the current restart is represented by an update operation (first delete, then create when the dependencies are met). The delete requested by the RuntimeManager is running in a separate task. During the delete the create is executed. The create fails because the old container still exists with the similar name on the runtime and the WorkloadControlLoop executes some retries to repeat the create. At the end the create is successfully after the old workload was deleted on the runtime.

The delays will be avoided with switching the restart operation into a real bundle start instead of an update (delete + create).

windsource commented 1 month ago

The workload deletion and re-creation happens in https://github.com/eclipse-ankaios/ankaios/blob/9ed3c73c1eaf16d62f9e9a9444dba3834512db0d/agent/src/runtime_manager.rs#L271-L276. To change that the following steps are required:

  1. As the deletion/re-creation might be OK for some runtimes and not for others, the get_reusable_workloads() function needs to be provide that info.
  2. If a workload shall not be re-created but restarted from a stopped state, either a new method in addition to create_workload is required or create_workload needs to retrieve that info from a cache somehow.