OpenCHAMI / roadmap

Public Roadmap Project for Ochami
MIT License
1 stars 0 forks source link

[RFD] Federation/Orchestration of Failure Domains for Cell-Based deployment #41

Open alexlovelltroy opened 1 month ago

alexlovelltroy commented 1 month ago

OpenCHAMI Multi-Instance Management and Failure Domain Concept

This proposal introduces an enhancement to the OpenCHAMI deployment model, adding a federated deployment option alongside the existing standalone deployment. The federated model allows for managing multiple OpenCHAMI instances through a central orchestrator, centralizing certificate management, authentication, and authorization in a loosely coupled way that ensures continuity of service even when connectivity is compromised. It also proposes a new API focused on higher-level node management, abstracting away from lower-level component details.

Deployment Models

Standalone Deployment

The current deployment model involves deploying all OpenCHAMI components—microservices and supporting systems—within a single instance. This approach remains viable for simpler or smaller environments, providing a straightforward and integrated solution.

Federated Deployment

The new federated deployment model enables the management of multiple OpenCHAMI instances via a remote orchestrator. This model centralizes key functions such as certificate management, authentication, and authorization. It also introduces a higher-level API for managing nodes rather than individual components, offering enhanced scalability and flexibility for larger environments.

OpenCHAMI Failure Domains

In the federated deployment model, each OpenCHAMI instance is considered a "failure domain"—the scope within which changes are applied and failures are contained. The orchestrator coordinates with each failure domain to manage changes and collect results, ensuring that the impact of any single change is limited to its specific domain.

Cell-Based Architecture and Failure Containment

Cell-Based Architecture is a design paradigm aimed at containing failures within a distributed system. Unlike other architectures that strive for a globally consistent view of resources, cell-based architecture focuses on isolating data and functionality within independent cells. This design confines failures to the affected cell, allowing for localized management and minimizing the impact on the overall system.

Principles of Cell-Based Architecture

Cell-Based Architecture operates on several key principles:

Benefits of Cell-Based Architecture

The cell-based approach offers several benefits:

Examples of Cell-Based Architecture

Cell-based architecture can be applied in various scenarios:

Application to OpenCHAMI’s Failure Domains

For OpenCHAMI, the failure domain concept aligns with cell-based architecture principles. Each OpenCHAMI instance manages a self-contained failure domain, operating as a discrete cell within the larger system. This design provides:

By adopting cell-based architecture, OpenCHAMI enhances its ability to manage large, complex environments, ensuring robust failure containment and improved system resilience.

OpenCHAMI Orchestration and Federation Services

The OpenCHAMI Federation Services will include centralized third-party services to support federation and manage failure domains under orchestration. Each failure domain must be capable of operating independently if federation services become unavailable, with appropriate caching mechanisms implemented at the domain level.

Key services include:

Key Areas of Experimentation

To ensure the federated model’s effectiveness, exploration is needed in three key areas:

  1. Registration and Secure Communication

    The registration process must be defined to establish connections between OpenCHAMI instances and the orchestration service. Considerations include the need for additional services within each domain to facilitate registration and communication, the potential use of lightweight VPN solutions like WireGuard for added security alongside TLS, and the extension of OIDC services to enable secure, two-way communication with TPMs to ensure system integrity.

  2. Coordinating Actions Across Domains

    Coordinating actions triggered by the orchestrator and executed in parallel across multiple domains requires evaluating off-the-shelf workflow management systems, such as Temporal. These systems should be assessed for their ability to handle the scale and complexity of the required tasks.

  3. Orchestration API Design

    The Orchestration API must be designed to support necessary functionality for managing nodes, handling certificate operations, and coordinating actions across failure domains. The API design should align with user requirements and support efficient, secure management of the federated deployment.

  4. Severability

    Each failure domain must not only be capable of functioning in an isolated manner, but admins need to be able to take corrective actions when the orchestration manager isn't reachable. As with all other distributed systems, the times when troubleshooting and remediation are most necessary are generally when the entire system is in an unknown state. See Bronson, Charapko, Aghayev, and Zhu.

Conclusion

The proposed federated deployment mode for OpenCHAMI, in conjunction with the continued support for standalone deployments, provides enhanced scalability, security, and management flexibility. Addressing key areas such as registration and secure communication, parallel action coordination, and API design will ensure that OpenCHAMI delivers a robust solution for managing complex and large-scale environments.

stradling commented 2 weeks ago

Per the meeting comments: please note severability as a point of experimentation. Failure domains must be independently operable when the central manager is in a bad way.