inf17101 commented 10 months ago

When executing "ank get workloads" during a container is in the podman state "Stopping", then Ankaios CLI does not output the workload with its state "Stopping".

If "ank get state" is executed instead, it lists ExecutionState "ExecStopping" instead during podman reports the state as "Stopping".

Related PR for improvement of execution states: #127

Current Behavior

"ank get workloads" does not output execution state "Stopping" (only "Running" is output). This is different to "ank get state".

Expected Behavior

"ank get workloads" shall output all execution states besides "Removed".

Steps to Reproduce

Use the current main branch at 8964d65

Ankaios Server startConfig.yaml:

workloads:
  hello1:
    runtime: podman
    agent: agent_B
    restart: true
    updateStrategy: AT_MOST_ONCE
    accessRights:
      allow: []
      deny: []
    tags:
      - key: owner
        value: Ankaios team
    runtimeConfig: |
      image: alpine:latest
      commandOptions: [ "--entrypoint", "/bin/sleep"]
      commandArgs: [ "2000"]

Start a server with the initial startup state mentioned above. ank-server -c /tmp/startConfig.yaml
Run ank delete workload hello1 to delete the workload hello1.
Run ank get workloads to see the absence of workload "hello1" with execution state "Stopping" (the delete lasts a few seconds so that this state shall become visible.)
Repeat the steps with ank get state instead of ank get workloads to see that this time the execution state is output correctly during podman deletes the workload.

Context (Environment)

Podman 4.6.2 Linux

Logs

Additional Information

Final result

The CLI has been fixed to show the stopping workload. See #154

krucod3 commented 10 months ago

The delete command currently forces Ankaios to kick out the workload from the current state. The ank CLI is depending on the workloads section in the current state of the get state command and ignores the state only section if the workloads don't have a corresponding entry in current state. This is actually caused by a too simple logic in the Ankaios server. The workload should not be deleted from the current state directly, but moved to stopping and deleted only after a state messages for the successful delete is received.

maturar commented 10 months ago

I have analyzed behavior how the server handles deleting workloads. I can see three scenarios.

Delete the workload as described in the ticket.
Do the same, but with ank set state instead of ank delete workload
Use ank set state, but object mask points to the workload being deleted.

The case no. 1 more detailed.

The CLI requests complete state.
Removes the workload (which the user wants to delete) from the current state.
Sends the state to the server with the object mask currentState. In other words asks the server to update whole state.

The server:

clones whole currentState to the newState.
Takes the part of compete state received from the delete command referenced by the object mask and copies it into the new statate. In this case it actually overwrites whole new state created in the previous step. This sound like too complicated. But ank delete workload is only a special case of the ank set state command. With the ank set state we can set the object mask in such way, that it references only a part of the complete state (and therefore this step replaces only a part of the new state). See next scenarios.
The server compares workloads of the current state with the workloads of the new state and decides, which workloads shall be added or deleted.

The case no. 2 more detailed. This scenario is the same for the server. It differs only how the user does this in the CLI. In this case user sends ank set state with the object mask currentState. For the server it is the same. The server gets the same kind of request over the interface. This case can be used when the user wants to do more changes in the config. When the user wants to change/delete more workloads.

The case no. 3 more detailed.

User starts the server using the example config in the repository.
Then the user prepares following file with the update:

requestId: ank-cli
startupState:
  workloads: {}
  configs: {}
  cronJobs: {}
currentState:
  workloads:
    hello1:
      agent: agent_B
      name: hello1
      tags:
      - key: owner
        value: Ankaios team
      dependencies: {}
      updateStrategy: AT_MOST_ONCE
      restart: true
      accessRights:
        allow: []
        deny: []
      runtime: podman
      runtimeConfig: |
        image: alpine:latest
        commandOptions: [ "--rm"]
        commandArgs: [ "echo", "Hello Ankaios"]
    hello2:
      agent: agent_B
      name: hello2
      tags:
      - key: owner
        value: Ankaios team
      dependencies: {}
      updateStrategy: AT_MOST_ONCE
      restart: true
      accessRights:
        allow: []
        deny: []
      runtime: podman
      runtimeConfig: |
        image: alpine:latest
        commandArgs: [ "echo", "Hello Ankaios"]
    nginx:
      agent: agent_A
      name: nginx
      tags:
      - key: owner
        value: Ankaios team
      dependencies: {}
      updateStrategy: AT_MOST_ONCE
      restart: true
      accessRights:
        allow: []
        deny: []
      runtime: podman
      runtimeConfig: |
        image: docker.io/nginx:latest
        commandOptions: ["-p", "8081:80"]
  configs: {}
  cronJobs: {}
workloadStates:
- workloadName: nginx
  agentName: agent_A
  executionState: ExecRunning
- workloadName: hello1
  agentName: agent_B
  executionState: ExecRemoved
- workloadName: hello2
  agentName: agent_B
  executionState: ExecSucceeded
- workloadName: hello-pod
  agentName: agent_B
  executionState: ExecRunning

The difference (comparing to the start config) is that the workload hello-pod is deleted.

The user sends following command with the CLI:

./ank set state --file updateState.yaml currentState.workloads.hello-pod

Important is that the object mask refers to the workload hello-pod which has been removed in the update config. This way we would like to delete the workload hello-pod and nothing else. Now the server must do something a bit different comparing to the previous two cases.

clones whole currentState to the newState.
removes hello-pod from the new state.
The server compares workloads of the current state with the workloads of the new state and decides, which workloads shall be added or deleted.

maturar commented 10 months ago

The key logic described in the previous commend is in the server int he update_state.rs in the function update_state.

I agree with Kaloyan that the server shall not remove the workload directly, but set the state into stopping instead. The tricky part is that the function update_state can remove workload in two ways. Explicitly and implicitly.

The function deletes the workload explicitly with the code:

       } else if new_state.remove(&field).is_err() {
            return Err(UpdateStateError::FieldNotFound(field.into()));
        }

This code is used in the scenario no.3. When the object mask points to the workload, which has been deleted in the update.

The implicit (or better to say "a silent") way of deleting workload is done with the code:

            if new_state.set(&field, field_from_update.to_owned()).is_err() {
                return Err(UpdateStateError::FieldNotFound(field.into()));
            }

This code is used in the scenarios no.1 and 2. The object mask is set to the currentState (i.e. to the root) and the deleted workload is only a subpart of the received complete state (update).

I have to think about it, how to fix the bug reported here and support all scenarios described here. It probably means to reimplement both functions in the update_state.rs. The function update_state and prepare_update_workload.

maturar commented 9 months ago

Status update: I have discussed the current implementation with Kaloyan. The change here is dependent on other two issues:

149 and #156 . We have to change the implementation here to be consistent with the changes in the other two pull requests.

In this ticket we have to:

Keep the event StateChangeCommand::UpdateState in the ankaios_server.rs as it is. Handling of this event is being changed in the other two pull requests.
Handling the event StateChangeCommand::UpdateWorkloadState shall remain as it is here (remove the workload from the current state). Additionally we shall remove the workload state from the "workload state db".
We have to change the implementation of the command and get workloads in the CLI. The table with workloads shall use all information from the complete state. Now the table takes only the workloads in the current state. But now we shall do this:
- Create the table with workloads using the workload states from the complete state.
- Use information from the workloads in the current state to fill missing information (agent name). Do this only if the workload is present in the current state.
Do not update the unit tests now. The pull request #156 contains massive changes in the ankaios_server.rs. Update the unit tests here after merging #156 to the main. It would be better to wait also for the #149 too.

maturar commented 9 months ago

We had another discussion with Kaloyan and Christoph. We have agreed that we shall not change behavior of the StateChangeCommand::UpdateWorkloadState (as described in the previous comment). The workload shall not be removed from the current state when the container has the autoremove flag. In another words we shall do the change only in the ank get workloads.

krucod3 commented 9 months ago

When a workload disappears for the runtime, it shall be handles with an extra state and not the same way as if Ankaios has deleted the workload. These changes will be made outside of the current PR.

maturar commented 9 months ago

The PR have been merged into main -> closing the ticket.

eclipse-ankaios / ankaios

Ankaios CLI "ank get workloads" does not list workload state "Stopping" and differs from "ank get state" #134

Current Behavior

Expected Behavior

Steps to Reproduce

Context (Environment)

Logs

Additional Information

Final result

149 and #156 . We have to change the implementation here to be consistent with the changes in the other two pull requests.