thehunmonkgroup commented 1 year ago

PITCH: Resource Manager for ACE Framework

Problem

The Autonomous Cognitive Entity (ACE) framework is a collection of resources designed to function together. However, the framework specification provides no specific implementation for how these resources are managed. This includes starting the ACE, monitoring the components for failure, recovering from failures, and stopping the ACE.

Appetite

The team is prepared to invest approximately two weeks and 20-30 total man hours to address this problem.

Solution

The proposed solution involves creating a 'Resource Manager', a long-running Python process. This process will treat ACE components as 'resources'. Each resource will be managed by a Python agent that operates in a manner similar to the Open Cluster Framework (OCF), specifically implementing its core 'start', 'stop', and 'monitor' methods. The operations are as follows:

Step 1: Initiate the Resource Manager. It calls the 'start' operation on all resource agents. The order is determined by a resource dependency graph, stored in a static configuration such as a YAML file.
Step 2: Once all resources are started, the Resource Manager enters a monitoring loop. It calls the 'monitor' operation on one resource agent at a time, moving from the least dependent to the most dependent resource as per the dependency graph.
Step 3: If a monitor returns a failure, the Resource Manager takes corrective action. It calls the 'stop' operation, then the 'start' operation on the agent of the failed resource and its dependent resources.

Rabbit Holes

Given that the initial ACE is an MVP, the team should not be overly concerned with issues of scaling and performance, beyond those necessary for an individual user of the MVP to have a good experience.

No-gos

In order to maintain focus on the core problem and keep the solution manageable, the following aspects will not be included in this initial implementation:

Distributed resource management: The resource manager will not attempt to manage resources across a distributed system or network. It will focus solely on resources within a single ACE.
Automatic scaling or load balancing: While these are common features in robust resource managers, they won't be part of this initial solution.
Advanced failure recovery strategies: The resource manager will have a simple strategy of stopping and starting a failed resource and its dependencies. More advanced strategies, like resource migration or failover, will not be considered at this stage.

rburgmann commented 1 year ago

Just looked at a few open source solutions, Apache Helix, Mesos which may be of use. I think leveraging an existing solution rather than roll your own might be the better approach. Docker or Kubernetes also may be useful if you abstract the layers to individual virtual nodes.

daveshap commented 1 year ago

I like this and I think it should be integrated into the OOB "security overlay" e.g. "System Integrity" responsible for security, operational up status, and so on. That's my opinion, plenty of ways to implement. I'll create an updated diagram to represent what I mean

daveshap commented 1 year ago

@thehunmonkgroup thanks for all your thoughts on this. I have revamped the "security" layer into the "system integrity" layer and added plenty of insights: https://github.com/daveshap/ACE_Framework/blob/main/ACE_Framework.md#system-integrity

Great discussions.

thehunmonkgroup commented 1 year ago

@rburgmann thanks for the suggestions!

You may want to have a look at https://github.com/daveshap/ACE_Framework/blob/main/agile.md -- knowing that will help guide your thinking in these earlier, MVP stages of the project. Right now, anything that's not Python, and/or not dead easy to install and set up is probably going to be avoided.

Long term, or for someone taking a run at a serious production ACE now, I definitely think a robust resource manager like the ones you mentioned are essential. My personal favorite is Corosync/Pacemaker.

thehunmonkgroup commented 9 months ago

We're not using the PITCH structure anymore, closing.

daveshap / ACE_Framework