eclipse-ankaios / ankaios

Eclipse Ankaios provides workload and container orchestration for automotive High Performance Computing (HPC) software.
https://eclipse-ankaios.github.io/ankaios/
Apache License 2.0
60 stars 20 forks source link

Inter-workload dependenices #48

Closed windsource closed 6 months ago

windsource commented 1 year ago

Description

Some workloads must only be started when other workloads are already running. The Ankaios state definition already contains a field to define dependencies for a workload but it has not been implemented yet.

It also needs to be defined what it means when a workload is up and running. Is it sufficient that a container is running or should we also support something like lifecycle hooks in Kubernetes?

Goals

Ankaios shall support dependencies of workloads such that workloads are started after all dependencies have been started.

Final result

Summary

Ankaios enables users to configure dependencies between different workloads. Since the dependencies rely on the workload states, the ExecutionStates were updated and changed. There are now major states and substates to handle the creation and deletion of workloads having inter-workload dependencies properly.

New execution states (major states):

message ExecutionState {
  string additionalInfo = 1; /// The additional info contains more detailed information from the runtime regarding the execution state.
  oneof ExecutionStateEnum {
    AgentDisconnected agentDisconnected = 2; /// The exact state of the workload cannot be determined, e.g., because of a broken connection to the responsible agent.
    Pending pending = 3; /// The workload is going to be started eventually.
    Running running = 4; /// The workload is operational.
    Stopping stopping = 5; /// The workload is scheduled for stopping.
    Succeeded succeeded = 6; /// The workload has successfully finished its operation.
    Failed failed = 7; /// The workload has failed or is in a degraded state.
    NotScheduled notScheduled = 8; /// The workload is not scheduled to run at any agent. This is signalized with an empty agent in the workload specification.
    Removed removed = 9; /// The workload was removed from Ankaios. This state is used only internally in Ankaios. The outside world removed states are just not there.
  }
}

And there are new sub states within the proto file as well:

/**
* The workload is going to be started eventually.
*/
enum Pending {
  PENDING_INITIAL = 0; /// The workload specification has not yet being scheduled
  PENDING_WAITING_TO_START = 1; /// The start of the workload will be triggered once all its dependencies are met.
  PENDING_STARTING = 2; /// Starting the workload was scheduled at the corresponding runtime.
  PENDING_STARTING_FAILED = 8; /// The starting of the workload by the runtime failed.
}

/**
* The workload is operational.
*/
enum Running {
  RUNNING_OK = 0; /// The workload is operational.
}

/**
* The workload is scheduled for stopping.
*/
enum Stopping {
    STOPPING = 0; /// The workload is being stopped.
    STOPPING_WAITING_TO_STOP = 1; /// The deletion of the workload will be triggered once neither 'pending' nor 'running' workload depending on it exists.
    STOPPING_REQUESTED_AT_RUNTIME = 2; /// This is an Ankaios generated state returned when the stopping was explicitly trigged by the user and the request was sent to the runtime.
    STOPPING_DELETE_FAILED = 8; /// The deletion of the workload by the runtime failed.
  }

/**
* The workload has successfully finished operation.
*/
enum Succeeded {
  SUCCEEDED_OK = 0; /// The workload has successfully finished operation.
}

/**
* The workload has failed or is in a degraded state.
*/
enum Failed {
  FAILED_EXEC_FAILED = 0; /// The workload has failed during operation
  FAILED_UNKNOWN = 1; /// The workload is in an unsupported by Ankaios runtime state. The workload was possibly altered outside of Ankaios.
  FAILED_LOST = 2; /// The workload cannot be found anymore. The workload was possibly altered outside of Ankaios or was auto-removed by the runtime.
}

The running workload state is a bit special as only getting a state from podman that the container is running does not mean that the app deployed with it is also running and ready. What does running mean must be specified and will be handled separately with #109. For now Running means that the container was started and the runtime says it is running. Later this will be extended to other health checks.

These dependencies can be of two types: explicit and implicit.

Explicit dependencies are configured by the user within a workload's configuration. Ankaios considers these dependencies when starting workloads, ensuring they only start when all dependencies are met. Users can define dependency types such as running, succeeded, or failed.

The so-called AddConditions were added to the Ankaios.proto:

/**
* An enum type describing the expected workload state. Used for dependency management.
*/
enum AddCondition {
  ADD_COND_RUNNING = 0; /// The workload is operational.
  ADD_COND_SUCCEEDED = 1; /// The workload has successfully exited.
  ADD_COND_FAILED = 2; /// The workload has exited with an error or could not be started.
}

Implicit dependencies are defined internally by Ankaios to prevent workloads from failing or entering undesired states when a dependency is deleted. These dependencies are automatically set and cannot be configured by the user. Ankaios does not stop dependent workloads of a dependency. It delays the delete until the dependent workload has reached the workload state matching the delete condition.

The proto file contains also an internal message for specifying DeleteConditions:

/**
* An enum type describing the conditions for deleting a workload. Used for dependency management, and update strategies.
*/
enum DeleteCondition {
  DEL_COND_RUNNING = 0; /// The workload is operational.
  DEL_COND_NOT_PENDING_NOR_RUNNING = 1; /// The workload is not scheduled or running.
}

Ankaios ensures that manifests and workload configurations don't have cyclic dependencies, forming a directed acyclic graph. It also handles cases where workloads have dependencies that currently don't exist in the Ankaios state. In this case the workload creation is delayed.

Tasks

windsource commented 1 year ago

Interesting article about dependency management in systemd: https://unix.stackexchange.com/questions/331693/how-can-a-systemd-service-flag-that-it-is-ready-so-that-other-services-can-wait

windsource commented 1 year ago

Here an article to Configure Liveness, Readiness and Startup Probes in Kubernetes.

krucod3 commented 9 months ago

General idea

There are two sub-use-cases when handling the dependencies:

In order to directly start with the dependencies management, the server shall also first send the workload states before sending the list of desired workloads (reverse the current order).

We also should think about circular dependencies. If we have one, we should reject the configuration.

Standard case (fresh agent)

The standard case is relatively straight forward. We can build queues (HashMap) for workloads that don't have their dependencies met and look through the queues every time an UpdateWorkloadState comes.

Reusing workloads

This case is more complicated. If there are workloads running, we have to take care of their dependencies too. For each found workload that is supposed to run, see if the dependencies are met:

Summary

All in all, probably the simple approach is good enough taking into account that not more then 100 Workloads will be executed on a node.

lingnoi commented 9 months ago

I think an open point is also "how to define/configure the inter dependencies of the workloads concretely"?

krucod3 commented 9 months ago

@lingnoi: the current definition is per workload a list of dependencies where each dependency is has a workload name and a execution state. Do you have some other ideas here?

krucod3 commented 9 months ago

To not make things over-complicated we need only the cycle detection for the server and could add something like a reverse dependency list: a hash table which stores per workload name a list of workloads that have dependencies on that workload.

inf17101 commented 9 months ago

I would go with the simple approach first and implement the cycle detection to check for invalid states, but without the whole graph implementations.

We first test the config if there is a cycle. When we receive new workload states of the state checker, we can go through this queue again and check if all of the workloads specified in the dependency lists are now running (when talking about the current default behavior), then we can start the workload.

I am just thinking about if an UpdateWorkload would imply some action as well when dependency management comes into play. If we have a workload A running with dependency B and C and someone updates the dependency B via Ankaios CLI, shall we do an action, too? Maybe through the update workload A is broken. Or do we want to say it is user failure if the update crashes workload A this case? (Just as additional throw-in to consider)

inf17101 commented 9 months ago

To not make things over-complicated we need only the cycle detection for the server and could add something like a reverse dependency list: a hash table which stores per workload name a list of workloads that have dependencies on that workload.

I was checking the WorkloadSpec and there already all information we need is inside. I was thinking about using a Rc or Arc, maybe put inside a VecDeque, to reuse existing specs because they are already pushed into a data structure per runtime. Maybe we can use just a reference, no need to store the same information again. VecDeque is more suitable for Queue like data structures compared to HashMaps with the key / hash calculations.

krucod3 commented 9 months ago

I would go with the simple approach first and implement the cycle detection to check for invalid states, but without the whole graph implementations.

We first test the config if there is a cycle. When we receive new workload states of the state checker, we can go through this queue again and check if all of the workloads specified in the dependency lists are now running (when talking about the current default behavior), then we can start the workload.

This can be optimized if we can search for the workload that now has a new state, e.g., in a hash map.

I am just thinking about if an UpdateWorkload would imply some action as well when dependency management comes into play. If we have a workload A running with dependency B and C and someone updates the dependency B via Ankaios CLI, shall we do an action, too? Maybe through the update workload A is broken. Or do we want to say it is user failure if the update crashes workload A this case? (Just as additional throw-in to consider)

You are completely right here, we also need to handle properly updates and shutdowns in reverse order. This actually means that the server would also need to do some extra work of calculating delete operations for other workloads that have dependencies on the updated/deleted one.

krucod3 commented 9 months ago

To not make things over-complicated we need only the cycle detection for the server and could add something like a reverse dependency list: a hash table which stores per workload name a list of workloads that have dependencies on that workload.

I was checking the WorkloadSpec and there already all information we need is inside. I was thinking about using a Rc or Arc, maybe put inside a VecDeque, to reuse existing specs because they are already pushed into a data structure per runtime. Maybe we can use just a reference, no need to store the same information again. VecDeque is more suitable for Queue like data structures compared to HashMaps with the key / hash calculations.

Queues are great for preserving order or if no search operations are needed. The idea to have "a hash table which stores per workload name a list of workloads that have dependencies on that workload" is that I have as a key a workload name and as a value all it's dependencies. When we get an update workload state we can do a lookup of the Workloads we should take care of instead of making an exhaustive search.

krucod3 commented 9 months ago

As it seems we don't have any technical bottlenecks that would play a major role in the design of the feature. Even if we need to build a spanning forest of the dependency graph, the implementation wouldn't be that complicated and is not as such blocking in any way. I'll start collecting use-cases to completely clarify the problem space now. After that we can do it the test driven way and directly write some system (robot) tests covering the use-cases. We can then write the major design points down as requirements and after that start thinking on exact technologies for the implementation.

inf17101 commented 9 months ago

To not make things over-complicated we need only the cycle detection for the server and could add something like a reverse dependency list: a hash table which stores per workload name a list of workloads that have dependencies on that workload.

I was checking the WorkloadSpec and there already all information we need is inside. I was thinking about using a Rc or Arc, maybe put inside a VecDeque, to reuse existing specs because they are already pushed into a data structure per runtime. Maybe we can use just a reference, no need to store the same information again. VecDeque is more suitable for Queue like data structures compared to HashMaps with the key / hash calculations.

Queues are great for preserving order or if no search operations are needed. The idea to have "a hash table which stores per workload name a list of workloads that have dependencies on that workload" is that I have as a key a workload name and as a value all it's dependencies. When we get an update workload state we can do a lookup of the Workloads we should take care of instead of making an exhaustive search.

Ok I thought order is needed because we wanted to have a "queue" like described above, and we can then do a simple push/pop, without having the hash stuff on top. But is fine for me... I think we can discuss the implementation details later.

krucod3 commented 9 months ago

I understand the confusion now, queue as a place where elements are waiting, not as a data structure.

krucod3 commented 9 months ago

Initial list of use-cases we should consider during the development

For all use-cases we should consider both the case where the workloads are on the same node and case where they are on different nodes.

lingnoi commented 9 months ago

@lingnoi: the current definition is per workload a list of dependencies where each dependency is has a workload name and a execution state. Do you have some other ideas here?

No, that's fine, thanks!

inf17101 commented 9 months ago

I have done a research about how systemd handles the dependency management and as a summary it has a fine granular dependency management with a lot options. There are two primary dependency management topics: Requirement dependencies and ordering dependencies

Requirement dependencies:

Ordering dependencies: Before/After: Order dependencies, specifying the order in which units should start or stop in relation to each other.

With only specifying requirement dependencies without an explicit ordering dependency, systemd asumes best and the services are boot in parallel (and the action of the keyword is applied). If we set an ordering dependency in addition, then of course it is according to the order.

But there are more fine-granular features which can be used when choosing a combination of those dependency management tools. As an example, when using Requires= a unit on the right-hand side is explicitly stopped (for example through systemctl stop), then the defining unit is also stopped. Which means we it has indeed a more complex and fine-granular startup/shutdown behavior compared to just "basic" dependency management.

Here is a blog with examples: https://seb.jambor.dev/posts/systemd-by-example-part-2-dependencies/

I think we need to choose the scenarios that fits automotive use cases. No need to support all possible combinations. Hard dependencies are primary needed and then we need to discuss what we need on top.

krucod3 commented 9 months ago

For hard dependencies A -> B we would need to stop А before stopping B To make think a bit simpler for the server, we could internally write to B that it is needed by A. This way the agent can decide alone that it cannot stop B before A is gone. Edit: In the current interface this is handled by the server by sending a list of delete dependencies with the delete message. This workflow will work too, but if we implement an ordered shutdown, the server would need to send a delete messages for all workloads incl. their dependencies instead of one shutdown with all_workloads=true.

krucod3 commented 9 months ago

I now also reviewed the basic configuration of crinit and really like the ideas there.

Considering the behavior of systemd and crinit, we can do the following To support both hard and soft dependencies:

First, we can extend the ExpectedState in the proto API with the following 2 values:

This will allow us to support the following use-cases:

If we need it, we can also add the following dependency types later:

To be able to handle all dependencies properly we would also need a transitional state stopping for the workload execution states. There is currently also a bug (#123) about not handling Podman states correctly. It would be best to add the new execution state as a feature in a PR for this issue and after that fix the Podman states by mapping to the new state.

inf17101 commented 9 months ago

I like the idea to use hard and soft dependencies. I would not do more initially, because the other complex combinations of the services we have checked (systemd, crinit) for example does not fit automotive use case very well at the moment. Here we can learn in the future if something is needed in addition. But at a start I would go with the hard/soft dependencies with priority to hard (like you said). Dependency starting and running is good. For failed I think we do not need it.

inf17101 commented 7 months ago

I am working on the Agent implementation of inter-workload dependencies now, meaning waiting list for added and deleted workloads. I have checked out a new branch and started with the system tests first for dependencies.

inf17101 commented 7 months ago

Shall we go with the following on agent restarts?

If agents are restarted and the reuse workload procedure must run, the dependencies shall not be considered when the runtime config of the workloads have not changed. In this case we resume the workloads.

If agents are restarted and the runtime config of a workload has changed we must do a replace. In this case we consider the dependencies of that workload. If all execution states are fulfilled for that workload we can immediately replace the workload. If not we put it on the internal waiting queue and wait if all dependencies are fulfilled. Considering the dependencies for changed workloads implies the following: If the workload has a dependency to a existing (reusable) workload with execution state succeeded for example, but this workload has the wrong state for example failed, then there are two possibilities:

christoph-hamm commented 7 months ago

I think why should not add extra logic to restart the dependency. But I am also wondering, if this problem exists. If the dependency also exists (is a reusable workload), the agent should restart this dependency anyhow.

inf17101 commented 7 months ago

I think why should not add extra logic to restart the dependency. But I am also wondering, if this problem exists. If the dependency also exists (is a reusable workload), the agent should restart this dependency anyhow.

If this dependency exists but in another not expected state, we resume it (reusing procedure). This means just the state checker is started and it would report the failed state mentioned in the example above. In the reusing feature we do not restart workloads in general. We replace a workload if the runtime config has changed or we resume it otherwise. Meaning the behavior you mentioned can only happen if the dependency's runtime config has changed, too.

But if the runtime config has not changed and the dependency is just resumed then the workload depending on it would stay in the queue forever. Besides the user does an action and changes the something in the dependency so that it runs again and goes into the succeeded execution state.

I would agree with your suggestion that the agent shall not care about this dependency in this case, because this would add more complexity to the reusing feature in combination with the inter-workload dependencies feature. In this case the user shall handle it.

inf17101 commented 7 months ago

Ok I have implemented the dependencies handling in the reusing feature like the following:

if an agent is restarted the startup state could contain: workloads with no runtime config change (unchanged workloads), workloads with runtime config changes (changed workloads), new workloads and it can contain less workloads than before the restart.

The unchanged workloads are resumed and the dependencies are not considered. We just starting the state checker but do nothing with those existing workloads => Resume !

For the changed workloads the dependencies are considered. Here are two sub cases:

  1. The workload has dependencies on workloads managed by outside agents:
    • if all dependencies are fulfilled the workload is replaced immediately
    • if not it is put on the waiting queue
  2. The workload has at least one dependency managed by the same agent:
    • the workload is put on the waiting queue

The dependencies are considered for new workloads like without reusing feature. If a workload has dependencies it is put on the waiting queue and only started if all its dependencies are fulfilled. Workloads without dependencies are started immediately.

For sure, we ask ourselves why we cannot replace the workload having a dependency managed by the same agent immediately (point 2 above). The reason is that the existing interface to give the RuntimeManager the existing workloads on the runtime does not deliver the current workload states. It returns only the instance names. That was fine without dependency feature. To change that, the interface must be changed internally (relatively low effort), the concrete implementation of that interface of the podman runtime connector must be changed (relatively low effort) and the concrete implementation of that interface of the podman kube runtime connector must be changed (relatively high effort => we need to consider the volumes used for remembering config and specs).

So there are three possibilities:

  1. we keep the current implementation. That we cannot replace a workload directly when having dependencies on the same agent is not a problem in general, if the execution state of its dependency is turning to the expected one, the workload is started.
  2. We decide to simplify the dependency handling in combination with the reusing feature
  3. We go with the change of the interface which has higher efforts.

I am also open to another way.

inf17101 commented 6 months ago

Since the summary was not fitting 100% anymore to the actual implementation and the past discussions, I have updated it and tried to include everything which is still important and was suitable for the context.