eclipse-ankaios / ankaios

Eclipse Ankaios provides workload and container orchestration for automotive High Performance Computing (HPC) software.
https://eclipse-ankaios.github.io/ankaios/
Apache License 2.0
60 stars 18 forks source link

Restarting the Ankaios Server while Ankaios Agents are running results in restarting workloads by agents #16

Open windsource opened 1 year ago

windsource commented 1 year ago

Reported by @krucod3:

Current Behavior

If you start an Agent and a Server and restart the Server, the Agent receives again the initial workload list. If inside it are workloads that the Agent recognizes as running it tries to update them in order to account for a changed start config. The update currently leads to a restart of the workload.

This restart can be avoided if the AgentManager or the WorkloadFacade can recognize that the new and the old WorkloadExecutionInstanceName are the same.

Expected Behavior

It has to be clarified if this is a desired behavior. Restarting the Server is not a use case in the scope of Ankaios. If we don't restart the workloads on update, we cannot restart at all ...

Maybe we just need to document the behaviour

Steps to Reproduce

See above

Context (Environment)

Logs

Additional Information

inf17101 commented 6 months ago

@krucod3, @christoph-hamm: Currently, the inter-workload dependencies for deleted workload can not be considered in the case of a Ankaios server or agent restart. The information about the delete dependencies is not available inside the agents after a restart of an Ankaios component. Therefore, if a replace must be done, because of a changed config of the workload after restart, the workload is immediately scheduled for deletion.

Maybe, for proper reuse behavior the Ankaios agents could send the workloads they have found on the runtime via the AgentHello message. The server as a single point of truth shall then respond with UpdateWorkload message containing all informations about what the Agent shall do with all the new and existing workloads.

If the new behavior for Ankaios restarts is planned, we need to consider this also for the inter-workload dependencies in relationship to the reuse behavior. So, if someone takes this issue, this shall be also considered within the planning and implementation.