eclipse-ankaios / ankaios

Eclipse Ankaios provides workload and container orchestration for automotive High Performance Computing (HPC) software.
https://eclipse-ankaios.github.io/ankaios/
Apache License 2.0
60 stars 22 forks source link

Ankaios CLI "ank get workloads" does not list workload state "Stopping" and differs from "ank get state" #134

Closed inf17101 closed 9 months ago

inf17101 commented 10 months ago

When executing "ank get workloads" during a container is in the podman state "Stopping", then Ankaios CLI does not output the workload with its state "Stopping".

If "ank get state" is executed instead, it lists ExecutionState "ExecStopping" instead during podman reports the state as "Stopping".

Related PR for improvement of execution states: #127

Current Behavior

"ank get workloads" does not output execution state "Stopping" (only "Running" is output). This is different to "ank get state".

Expected Behavior

"ank get workloads" shall output all execution states besides "Removed".

Steps to Reproduce

Use the current main branch at 8964d65

Ankaios Server startConfig.yaml:

workloads:
  hello1:
    runtime: podman
    agent: agent_B
    restart: true
    updateStrategy: AT_MOST_ONCE
    accessRights:
      allow: []
      deny: []
    tags:
      - key: owner
        value: Ankaios team
    runtimeConfig: |
      image: alpine:latest
      commandOptions: [ "--entrypoint", "/bin/sleep"]
      commandArgs: [ "2000"]
  1. Start a server with the initial startup state mentioned above. ank-server -c /tmp/startConfig.yaml
  2. Run ank delete workload hello1 to delete the workload hello1.
  3. Run ank get workloads to see the absence of workload "hello1" with execution state "Stopping" (the delete lasts a few seconds so that this state shall become visible.)
  4. Repeat the steps with ank get state instead of ank get workloads to see that this time the execution state is output correctly during podman deletes the workload.

Context (Environment)

Podman 4.6.2 Linux

Logs

image

Additional Information

Final result

The CLI has been fixed to show the stopping workload. See #154

krucod3 commented 10 months ago

The delete command currently forces Ankaios to kick out the workload from the current state. The ank CLI is depending on the workloads section in the current state of the get state command and ignores the state only section if the workloads don't have a corresponding entry in current state. This is actually caused by a too simple logic in the Ankaios server. The workload should not be deleted from the current state directly, but moved to stopping and deleted only after a state messages for the successful delete is received.

maturar commented 10 months ago

I have analyzed behavior how the server handles deleting workloads. I can see three scenarios.

  1. Delete the workload as described in the ticket.
  2. Do the same, but with ank set state instead of ank delete workload
  3. Use ank set state, but object mask points to the workload being deleted.

The case no. 1 more detailed.

The server:

The case no. 2 more detailed. This scenario is the same for the server. It differs only how the user does this in the CLI. In this case user sends ank set state with the object mask currentState. For the server it is the same. The server gets the same kind of request over the interface. This case can be used when the user wants to do more changes in the config. When the user wants to change/delete more workloads.

The case no. 3 more detailed.

requestId: ank-cli
startupState:
  workloads: {}
  configs: {}
  cronJobs: {}
currentState:
  workloads:
    hello1:
      agent: agent_B
      name: hello1
      tags:
      - key: owner
        value: Ankaios team
      dependencies: {}
      updateStrategy: AT_MOST_ONCE
      restart: true
      accessRights:
        allow: []
        deny: []
      runtime: podman
      runtimeConfig: |
        image: alpine:latest
        commandOptions: [ "--rm"]
        commandArgs: [ "echo", "Hello Ankaios"]
    hello2:
      agent: agent_B
      name: hello2
      tags:
      - key: owner
        value: Ankaios team
      dependencies: {}
      updateStrategy: AT_MOST_ONCE
      restart: true
      accessRights:
        allow: []
        deny: []
      runtime: podman
      runtimeConfig: |
        image: alpine:latest
        commandArgs: [ "echo", "Hello Ankaios"]
    nginx:
      agent: agent_A
      name: nginx
      tags:
      - key: owner
        value: Ankaios team
      dependencies: {}
      updateStrategy: AT_MOST_ONCE
      restart: true
      accessRights:
        allow: []
        deny: []
      runtime: podman
      runtimeConfig: |
        image: docker.io/nginx:latest
        commandOptions: ["-p", "8081:80"]
  configs: {}
  cronJobs: {}
workloadStates:
- workloadName: nginx
  agentName: agent_A
  executionState: ExecRunning
- workloadName: hello1
  agentName: agent_B
  executionState: ExecRemoved
- workloadName: hello2
  agentName: agent_B
  executionState: ExecSucceeded
- workloadName: hello-pod
  agentName: agent_B
  executionState: ExecRunning

The difference (comparing to the start config) is that the workload hello-pod is deleted.

./ank set state --file updateState.yaml currentState.workloads.hello-pod

Important is that the object mask refers to the workload hello-pod which has been removed in the update config. This way we would like to delete the workload hello-pod and nothing else. Now the server must do something a bit different comparing to the previous two cases.

maturar commented 10 months ago

The key logic described in the previous commend is in the server int he update_state.rs in the function update_state.

I agree with Kaloyan that the server shall not remove the workload directly, but set the state into stopping instead. The tricky part is that the function update_state can remove workload in two ways. Explicitly and implicitly.

The function deletes the workload explicitly with the code:

       } else if new_state.remove(&field).is_err() {
            return Err(UpdateStateError::FieldNotFound(field.into()));
        }

This code is used in the scenario no.3. When the object mask points to the workload, which has been deleted in the update.

The implicit (or better to say "a silent") way of deleting workload is done with the code:

            if new_state.set(&field, field_from_update.to_owned()).is_err() {
                return Err(UpdateStateError::FieldNotFound(field.into()));
            }

This code is used in the scenarios no.1 and 2. The object mask is set to the currentState (i.e. to the root) and the deleted workload is only a subpart of the received complete state (update).

I have to think about it, how to fix the bug reported here and support all scenarios described here. It probably means to reimplement both functions in the update_state.rs. The function update_state and prepare_update_workload.

maturar commented 9 months ago

Status update: I have discussed the current implementation with Kaloyan. The change here is dependent on other two issues:

149 and #156 . We have to change the implementation here to be consistent with the changes in the other two pull requests.

In this ticket we have to:

maturar commented 9 months ago

We had another discussion with Kaloyan and Christoph. We have agreed that we shall not change behavior of the StateChangeCommand::UpdateWorkloadState (as described in the previous comment). The workload shall not be removed from the current state when the container has the autoremove flag. In another words we shall do the change only in the ank get workloads.

krucod3 commented 9 months ago

When a workload disappears for the runtime, it shall be handles with an extra state and not the same way as if Ankaios has deleted the workload. These changes will be made outside of the current PR.

maturar commented 9 months ago

The PR have been merged into main -> closing the ticket.