eclipse-ankaios / ankaios

Eclipse Ankaios provides workload and container orchestration for automotive High Performance Computing (HPC) software.
https://eclipse-ankaios.github.io/ankaios/
Apache License 2.0
60 stars 22 forks source link

Agent sends workload states of old workload after workload was updated #290

Closed inf17101 closed 4 months ago

inf17101 commented 4 months ago

When updating the speed-provider to use the auto mode in the user tutorial with the ank apply command the workload is updated successfully, but the output of ank apply and ank get workloads shows the old workload of speed-provider as Stopping(Stopping) or Failed(ExecFailed) even after the old workload was removed due to the update operation.

image

Current Behavior

Expected Behavior

Steps to Reproduce

  1. Execute the steps of the user tutorial and update the speed-provider like shown in the steps
  2. Look at the execution states of ank apply output and execute ank get workloads afterwards

If the user tutorial is executed without the Ankaios agent running with root privileges and using podman as root user the execution state is Failed(ExecFailed) instead of Stopping(Stopping).

Context (Environment)

Linux amd64 Ank cli

Logs

Additional Information

Final result

The WorkloadControlLoop was changed to check the validity of the WorkloadState it has inside its buffers to ensure that no old WorkloadStates are forwarded after its internal workload was updated already. Now, when executing the ank apply command the old workload is shown as Removed and not Stopping anymore and the updated workload is shown with workload state Running(Ok).

inf17101 commented 4 months ago

The reason is that the state checker has sent the Stopping(Stopping) or Failed(ExecFailed) execution state because podman has turned the state of the container to one of those states before actually removing the container. The tokio::select inside WorkloadControlLoop listens to those workload states and forwards them even if the workload was already updated inside the WorkloadControlLoop.

I implement and test a fix to check in the WorkloadControlLoop if the workload state is still matching to the actual managed workload of the control loop.

inf17101 commented 4 months ago

@krucod3: I decided on a fix that prevents the old workload state from being sent at the workload state sender (WorkloadControlLoop). Not at the workload state receiver (transition function). It is only the production code change for now and seems to work when testing the User tutorial again where the issue was noticed. What do you think?

If it's fine, I continue with tests and req adaptions.

inf17101 commented 4 months ago

PR was reviewed and merged.