Cray-HPE / community

MIT License
5 stars 1 forks source link

Initial Tenant and Partition Management System #11

Open alexlovelltroy opened 3 years ago

alexlovelltroy commented 3 years ago

Abstract

Our existing APIs for managing groups of nodes are based on adjusting group membership in the hardware state manager. In order to address the multitenancy use cases our customers want from CSM, we need to add a higher level API that can manipulate groups in HSM as well as orchestrate any changes to networking and other services. This enhancement proposal covers the suggested design for the TAPMS service and suggests progressive delivery of the functionality.

Problem Statement

System Administrators have crude tools to manage groups with HSM, but there's no convenient way to create a partition based on rules that matter to users. For example, creating a group/partition that covers all nodes in a cooling group involves several api calls and nothing prevents illogical groupings. As we add multitenancy use cases and network isolation logic, the safety of these operations becomes more important. In part, this enhancement prevents future customer problems by suggesting a higher level API is more appropriate than a low level API for these kinds of operations.

Internal References

External References

Use this section to point to other software that has either dealt with or avoided similar problems. Even though this is the external section, links to other CSM software are appropriate.

Proposed Solution(s)

I propose the TAPM Service as primarily CRUD operations on several resources with lifecycle events tied to state changes in each resource. The partition and tenant resources are the primary targets of the API, but rather than working on them directly, most options should be managed on template resources that can be adjusted and validated before they are applied to the system which updates the relevant partition/tenant resources. The canonical representation of all resources in this service are uuids from which all urls are built. Convenience names are metadata applied on top of the resource. Convenience URLs may be supported for referencing resources by name.

Partition Resource

The partition resource roughly maps to the HMS group representation, but provides a unique endpoint which can also store metadata about the resource such as ownership, timestamps for lifecycle events, and status. With subsequent updates of the partition resource, we should add fields to represent things like management network and Slinghshot VLANS, ip subnets, DNS domains, spiffe trust domains, and other partition specific metadata that may be optionally created and managed as part of the lifecycle of the partition.

Partition Template Resource

Rather than interacting directly with the partition resources for most create operations, the template resource is primary. It can be managed without actually changing the state of the system and can be validated against the current state of the cluster to ensure it is possible to apply it. When the administrator is satisfied that a template is valid and ready, he or she must call the API representing the partition with a create operation that references the url/id of the template or by embedding the template in the create operation payload. The API should also support the creation of new templates via copying of existing templates.

Tenant Resource

The tenant resource roughly maps to a group of users that are able to access and/or manage one or more partitions. For initial multitenancy use cases, it is unlikely to be necessary as the authentication/authorization is unlikely to be split and users are unlikely to interact with the Keycloak-managed authentication and authorization domains. Like the partition resource, it stores metadata and timestamps of lifecycle events.

Tenant Template Resource

The tenant template resource is analogous to the partition template resource. It allows system administrators to experiment and validate tenants without actually changing the system. They can iterate safely until they are satisfied that the template is valid and appropriate for use before calling the tenant API to create the tenant resource.

Impact of Action/Inaction

Without action on this proposed enhancement, administrators will need to develop their own tooling to interact with the existing APIs and add the appropriate safety rules. It is unlikely that multiple customers will handle this the same way which will cause fragmentation within the CSM community.

Action on this proposal creates a de-facto standard for api-driven partitioning of an HPC system and we should consider releasing it as a standard at a future date so other system management tooling can benefit.

rkleinman-hpe commented 3 years ago

Requesting feedback from @zcrisler.

Comment period for this EP will close on 28 May 2021.

mpkelly-hpe commented 3 years ago

Can/should this doc include some hypothetical examples/use cases? For example, the template concept is a little unclear -- an example would help.

Should these EP docs going forward include use cases and/or examples? I think it would be helpful.

alexlovelltroy commented 3 years ago

Are you looking for the contents of the external references to be added to the EP or is there more detail missing?

zcrisler commented 3 years ago

Milestone 1 for TAPMS says the rules engine should constrain partitions:

Lowest level partition is a cooling group, 4 mountain cabinets, down to the CMU per chassis for VLAN configuration

Regarding WLM multitennancy: If the integration point for calling this API is from SLURM prolog/epilog do we need to ensure partitions are precisely aligned with the granularity of SLURM's scheduler?

alexlovelltroy commented 3 years ago

I'd rather not include SLURM in the design here and use the plugin/prolog/epilog system as a bridge between the SLURM granularity and our practical limits. Basically, lets force the prolog to fail if it requests a configuration of nodes that can't work.

zcrisler commented 3 years ago

So then we're pushing a requirement onto WLMs that they schedule jobs onto nodes based on TAPM's partition constraints? Do we need someone from WLM to review this proposal then?

rkleinman-hpe commented 3 years ago

@zcrisler -- looks like it is up to you to decide if WLM should review this proposal.

johren-hpe commented 3 years ago

I agree with @mpkelly-hpe . I see there are use cases defined in the internal references. I think it would be helpful to provide an example of how one or two of those use cases could be applied to the solution described above. IOW, how might I use the Partition Template Resource and/or Tenant Template Resource to implement the Multi-Cluster Management use case.

jeremy-duckworth commented 1 year ago

Software-defined multi-tenancy, as modeled after the initial TAPMS broad concept, began implementation in CSM 1.3 and is being guided in large part by interaction with the multi-tenancy SIG. Marking this EP as done, but welcome additional EPs to guide direction as part of out external to the multi-tenancy SIG. See https://github.com/Cray-HPE/docs-csm/blob/main/operations/multi-tenancy/Overview.md for the most recent documentation covering the multi-tenancy feature set.