eclipse-ankaios / ankaios

Eclipse Ankaios provides workload and container orchestration for automotive High Performance Computing (HPC) software.
https://eclipse-ankaios.github.io/ankaios/
Apache License 2.0
60 stars 22 forks source link

Re-connected Ankaios agent not receiving deleted workload when no added workloads are sent #358

Closed inf17101 closed 1 month ago

inf17101 commented 2 months ago

When an Ankaios agent has assigned one workload and the workload will be deleted during the Ankaios agent is disconnected, then the Agent does not delete this workload once it reconnects to the Ankaios server.

Current Behavior

A reconnected agent with only one workload that will be deleted during the agent is disconnected will lead to a running workload that is not anymore in the desired state and the deleted workload will not be deleted. The agent is blind and does nothing with this workload on startup because the Ankaios server does only sent the added workloads and the workload states when an agent connects.

If the agent has assigned more then one workload and other workloads are not deleted during the agent is disconnected then the Agent receives added workloads from the server and the initial resume/replace logic of the agent runs which correctly deletes the deleted workload. The agent enforces correctly the desired state in this situation.

Expected Behavior

An Ankaios agent shall always enforce the desired state of the Ankaios server and shall not be out of sync.

Steps to Reproduce

  1. Start an Ankaios server with server/resources/startConfig.yaml
  2. Start one Ankaios agent with name agent_A to make sure that only the nginx workloads runs (the rest of the workloads inside the startConfig can be ignored for this situation)
  3. Disconnect the agent after its workload nginx has the running state
  4. Execute an ank -k set state desiredState emptyState.yaml containing an empty state to delete all workloads or use ank -k delete workload nginx
  5. Start the agent_A again in the other terminal window.
  6. Check the logs of server and agent

The agent connects to the server awaiting new commands of it, but since there are no new workloads of that agent inside the desired state, the server sends nothing to this agent.

Context (Environment)

Logs

Additional Information

Final result

To be filled by the one closing the issue.

GabyUnalaq commented 2 months ago

Currently working on this.

GabyUnalaq commented 1 month ago

After discussions with @krucod3 and @inf17101, the preferred approach will be as follows.

A new message is created, called ServerHello (name can be changed) as a response to the AgentHello, with information regarding the added workloads of the agent that connected. This way, the functionality remains the same and also this corner-case is covered.

This way, we separate the first response of the server from the regular updates, improving the overall logic of the code and making possible future improvements regarding configuration for the agents.

krucod3 commented 1 month ago

@GabyUnalaq, PR merged.