eclipse-ankaios / ankaios

Eclipse Ankaios provides workload and container orchestration for automotive High Performance Computing (HPC) software.
https://eclipse-ankaios.github.io/ankaios/
Apache License 2.0
60 stars 18 forks source link

Ank CLI set state wait mode stucks when a new state deletes not initially started workloads #320

Open inf17101 opened 2 months ago

inf17101 commented 2 months ago

When using ank set state in the wait mode and the passed new complete state removes workloads that are in the current desired state but were not initially started because their Ankaios agent is not running, then the wait mode stucks and hangs up.

Current Behavior

Ank CLI set state hangs up: image

Expected Behavior

Set state shall know that they are not initially started but removed. It shall not stuck.

Steps to Reproduce

To reproduce the issue you can use the provided example startConfig.yml inside the repository. The config contains one workload for agent_A and three other workloads for agent_B.

  1. ./ank-server -c /path/to/ankaios_repository/server/resources/startConfig.yaml
  2. ./ank-agent --name agent_A
  3. Do not start the agent_B
  4. The workload on agent_A shall be operational. The other three workloads of agent_B shall be pending.
  5. ./ank get state > new_state.yml
  6. edit the new_state.yml and remove all three workloads of agent_B
  7. ./ank set state desiredState.workloads -f new_state.yml
  8. The command does not return (it stucks)

Context (Environment)

Ank CLI ank set state command all supported platforms

Logs

Additional Information

Final result

To be filled by the one closing the issue.

inf17101 commented 1 month ago

After analyzing the issue, the server does not send the WorkloadState with ExecutionState::Removed for workloads that are in Pending::Initial and were removed due to the previous executed update state. To handle this situation correctly and to introduce a proper fix, the server needs to know about the connected agents. The list agents feature #155 is scheduled for the next release, so the bug fix relying on this feature, will be moved to the next release, too.

inf17101 commented 1 week ago

Currently working on this. Since the topic is a little bit more complex to solve then initially thought (there where more stucking cases like for unscheduled workloads. And also how the wait list is initialized currently on main branch with data of the requested complete state is not correct for some corner cases).

I have commit to the branch https://github.com/eclipse-ankaios/ankaios/tree/320_fix_deleted_pending_workloads a first solution for most of the stucking cases. However, currently there is some basic implementation for testing that the stucking wait list and missing information in the table of deleted pending workloads is fixed. Never the less, I need to refactor the code (I want to get rid of the BTreeMap which I thought I need initially when starting to implement the bug fix, and in addition I will eliminate the second get_complete_state request since the new added workloads can be taken from the new constructed complete state.)

I will continue on this after my vacation, since tests and requirements must be adapted as well.