eclipse-ankaios / ankaios

Eclipse Ankaios provides workload and container orchestration for automotive High Performance Computing (HPC) software.
https://eclipse-ankaios.github.io/ankaios/
Apache License 2.0
60 stars 18 forks source link

Support eventual consistency #298

Open windsource opened 2 months ago

windsource commented 2 months ago

Description

When applying a new manifest, currently Ankaios has a fixed number of retries when the start fails and after that finally gives up. The workload remains in state Pending, subState StartingFailed. There can be different reasons why the start fails like

While some of the problems cannot be solved without changing the manifest (e.g. invalid options) others might disappear after some time (e.g. registry not available or folder not existing).

Some users expect that Ankaios constantly tries to reach the desired that and also that Ankaios provides the result of the latest try (e.g. Podman error message).

Goals

Final result

Summary

To be filled when the final solution is sketched.

Tasks

windsource commented 2 months ago

Maybe we can also have an optional maximum time before Ankaios stops to reach desired state. The parameter could be part of a config file (see #302).

inf17101 commented 2 months ago

Builds upon #67 (PR #137)