alexlovelltroy commented 3 months ago

OpenCHAMI Multi-Instance Management and Failure Domain Concept

This proposal introduces an enhancement to the OpenCHAMI deployment model, adding a federated deployment option alongside the existing standalone deployment. The federated model allows for managing multiple OpenCHAMI instances through a central orchestrator, centralizing certificate management, authentication, and authorization in a loosely coupled way that ensures continuity of service even when connectivity is compromised. It also proposes a new API focused on higher-level node management, abstracting away from lower-level component details.

Deployment Models

Standalone Deployment

The current deployment model involves deploying all OpenCHAMI components—microservices and supporting systems—within a single instance. This approach remains viable for simpler or smaller environments, providing a straightforward and integrated solution.

Federated Deployment

The new federated deployment model enables the management of multiple OpenCHAMI instances via a remote orchestrator. This model centralizes key functions such as certificate management, authentication, and authorization. It also introduces a higher-level API for managing nodes rather than individual components, offering enhanced scalability and flexibility for larger environments.

OpenCHAMI Failure Domains

In the federated deployment model, each OpenCHAMI instance is considered a "failure domain"—the scope within which changes are applied and failures are contained. The orchestrator coordinates with each failure domain to manage changes and collect results, ensuring that the impact of any single change is limited to its specific domain.

Cell-Based Architecture and Failure Containment

Cell-Based Architecture is a design paradigm aimed at containing failures within a distributed system. Unlike other architectures that strive for a globally consistent view of resources, cell-based architecture focuses on isolating data and functionality within independent cells. This design confines failures to the affected cell, allowing for localized management and minimizing the impact on the overall system.

Principles of Cell-Based Architecture

Cell-Based Architecture operates on several key principles:

Modularity: The system is divided into distinct, manageable cells, each responsible for specific roles or services. This modularity facilitates isolated updates, scaling, and maintenance.
Isolation: Each cell functions independently, ensuring that issues or failures in one cell do not impact others. This isolation is crucial for maintaining overall system stability and limiting the scope of failures.
Redundancy and Fault Tolerance: Cells are designed with redundancy and fault tolerance in mind. By replicating critical components within each cell, the system can continue operating even if some components or cells fail.
Scalability: Cells can be scaled independently to meet demand. This approach ensures efficient resource allocation and the ability to adapt to changing requirements.

Benefits of Cell-Based Architecture

The cell-based approach offers several benefits:

Localized Impact: Failures are confined to the affected cell, reducing the risk of widespread outages. For instance, a failure in a cooling group-level cell does not affect other cells in different cooling groups or datacenters.
Simplified Troubleshooting: Issues can be addressed more easily within individual cells, allowing administrators to focus on specific areas without examining the entire system.
Reduced Downtime: Maintenance or upgrades can be performed on individual cells without disrupting the entire system. This helps to minimize downtime and maintain continuous operation.
Enhanced Security: Isolating cells can improve security by containing potential breaches within a single cell, preventing unauthorized access from affecting other parts of the system.

Examples of Cell-Based Architecture

Cell-based architecture can be applied in various scenarios:

Cabinet-Level Cells: In a datacenter, each cabinet functions as a cell. If a cabinet encounters a hardware failure or requires maintenance, other cabinets remain operational, ensuring the datacenter’s overall functionality.
Cooling Group-Level Cells: Cooling systems in a datacenter are often organized into groups. Each group can act as a cell, managing specific cabinets or racks. A failure in one cooling group does not affect others, preventing overheating and maintaining system stability.
Regional or Availability Zone-Based Cells: In large-scale environments, it is useful to organize resources into regions or availability zones, each acting as a cell. Cloud providers often group resources this way, allowing for comprehensive management and redundancy. Failures in one region or zone do not impact others, ensuring service continuity.

Application to OpenCHAMI’s Failure Domains

For OpenCHAMI, the failure domain concept aligns with cell-based architecture principles. Each OpenCHAMI instance manages a self-contained failure domain, operating as a discrete cell within the larger system. This design provides:

Granular Control: Administrators can manage and monitor individual failure domains independently, allowing for precise control over changes and maintenance without affecting other domains.
Fault Isolation: Failures or changes in one domain are contained, minimizing potential disruptions and ensuring that impacts are limited to the affected domain.
Scalable Management: OpenCHAMI can scale across multiple failure domains, each operating independently but managed cohesively through the federated model.

By adopting cell-based architecture, OpenCHAMI enhances its ability to manage large, complex environments, ensuring robust failure containment and improved system resilience.

OpenCHAMI Orchestration and Federation Services

The OpenCHAMI Federation Services will include centralized third-party services to support federation and manage failure domains under orchestration. Each failure domain must be capable of operating independently if federation services become unavailable, with appropriate caching mechanisms implemented at the domain level.

Key services include:

Certificate Authority Services: In standalone deployments, StepCA provides automated certificate generation and renewal through the ACME protocol, supporting multiple certificate chains as needed.
OpenID Connect (OIDC) Services: [Details needed]
JSON Web Token Issuance Services: [Details needed]
API Federation Services: A new microservice that offers an API interface for users, managing downstream interactions with each failure domain.

Key Areas of Experimentation

To ensure the federated model’s effectiveness, exploration is needed in three key areas:

Registration and Secure Communication

The registration process must be defined to establish connections between OpenCHAMI instances and the orchestration service. Considerations include the need for additional services within each domain to facilitate registration and communication, the potential use of lightweight VPN solutions like WireGuard for added security alongside TLS, and the extension of OIDC services to enable secure, two-way communication with TPMs to ensure system integrity.
Coordinating Actions Across Domains

Coordinating actions triggered by the orchestrator and executed in parallel across multiple domains requires evaluating off-the-shelf workflow management systems, such as Temporal. These systems should be assessed for their ability to handle the scale and complexity of the required tasks.
Orchestration API Design

The Orchestration API must be designed to support necessary functionality for managing nodes, handling certificate operations, and coordinating actions across failure domains. The API design should align with user requirements and support efficient, secure management of the federated deployment.
Severability

Each failure domain must not only be capable of functioning in an isolated manner, but admins need to be able to take corrective actions when the orchestration manager isn't reachable. As with all other distributed systems, the times when troubleshooting and remediation are most necessary are generally when the entire system is in an unknown state. See Bronson, Charapko, Aghayev, and Zhu.

Conclusion

The proposed federated deployment mode for OpenCHAMI, in conjunction with the continued support for standalone deployments, provides enhanced scalability, security, and management flexibility. Addressing key areas such as registration and secure communication, parallel action coordination, and API design will ensure that OpenCHAMI delivers a robust solution for managing complex and large-scale environments.

stradling commented 2 months ago

Per the meeting comments: please note severability as a point of experimentation. Failure domains must be independently operable when the central manager is in a bad way.

dev-zero commented 1 month ago

(All views expressed are my own. If at all, they originate from my role as an OpenCUBE developer.) Again I think this RFD should possibly be split:

A request to standardise/document AAA of OpenCHAMI to make it modular and somewhat future proof. Having gotten so many requests to re-auth or re-login to federated services over the last couple of years due to continuous changes makes me believe that it will never be as independent as people wish it to be. Furthermore we need to support multiple auth providers per deployment to cover specific use cases. The AAA part of this proposal also assumes implementation of #11
Federation itself which can take different forms. Here it was proposed to centralise AAA, which in some deployments may be exactly what you do not want: while you may want to have superadmins being able to manage other deployments, users (admins) from those deployments should be isolated to a specific site. Hence I see this as (partially) orthogonal. With regard to federation visibility to services: if possible I would prefer to keep it isolated to higher level components as much as possible, with most microservices not maintaining enough state, or replicating them on per-tenant basis.

In general for me there is some confusion around Federation vs Orchestration vs Multi-Instance Mgmt vs Multi-Tenancy. It would be great if we could figure out the immediate use cases and distill required common functionality out of it to potentially avoid future re-engineering.

OpenCHAMI / roadmap

[RFD] Federation/Orchestration of Failure Domains for Cell-Based deployment #41