OpenCHAMI / roadmap

Public Roadmap Project for Ochami
MIT License
2 stars 0 forks source link

[RFD] Supercomputer Institute User Stories #25

Closed synackd closed 3 months ago

synackd commented 9 months ago

Context

To better focus efforts to get a usable system working for the Supercomputer Institute, it will help to know what the needs of the SI students will be. Therefore, this RFD serves as a reference of user stories that Ochami should fulfill for SI students, administrators (instructors), and mentors.

Assumptions

Structure

It is assumed that there are several, separate 10-node clusters, each assigned to a group of students (users). Each cluster is distinct from each other and each is relatively small. All of the SI clusters, as well as the configuration management server and distribution server, are behind a single managed switch.

There exists a configuration management server where students provision their cluster (in the past, this meant running Ansible against their head node to configure Warewulf).

There exists a distribution server that is meant to provide services to the clusters, e.g. a package repository, a Git server, head node power/console access (admin-only), etc.

Users

The users being considered are:

User Stories

Administrator

Cluster User

synackd commented 9 months ago

We can add/remove/modify as needed.

stradling commented 9 months ago

As an administrator, I would like to be able to quickly create new boot images with human-readable naming schemes using a relatively simple command As an administrator, I would like to be able to download boot images and examine them for issues As an administrator, I would like to be able to watch the image build configuration logs with a relatively simple command As an administrator, I would like to be able to iterate rapidly on image builds As an administrator, I would like to be able to create a staging image to test changes without displacing the present image As an administrator, I would like to be able to build images from arbitrary versions of a given configuration set As an administrator, I would like to be able to be able to separate node config code from node config settings As an administrator, I would like to be able to manage images (deletion, listing) with simple commands As an administrator, I would like to be able to test modifications to node configuration code and settings on a running compute node prior to committing/tagging using a simple workflow As an administrator, I would like to be able to deploy configless Slurm so that compute nodes pick up their slurm.conf automatically

As a cluster user, I would like to be able to ssh to nodes that Slurm is using for my jobs

stradling commented 9 months ago

As an administrator, I would like to be able to prevent cluster users from downloading boot images or configs As an administrator, I would like to be able to prevent cluster users from reaching the management nodes As an administrator, I would like to be able to check node health automatically and take malfunctioning nodes out of service

alexlovelltroy commented 9 months ago

@stradling These are super helpful. We can integrate most of them as further detail on the upcoming task descriptions.

Can you help clarify why each of them matters for our Supercomputer Institute? Some of them feel more advanced than we need, but I may only be thinking about the action you've described without the outcome you're shooting for.

stradling commented 9 months ago

The second set are more advanced. The first set are mostly with my SI lead experience in mind -- those are the complexity points I feel like will stymie students. Heck, they are the points that stymie colleagues in Platforms.

njones-lanl commented 9 months ago

As an administrator, I would like the ability to give each node a specific piece of configuration.

Why it matters for SI: I think there are a lot of configs that students may want to A/B test, one off test, or customize for a specific functionality on a node. In warewulf terms I think this is something like wwsh file import yourfilehere for a specific node. I think this is covered in page 12/13 of chapter 6 of last years SI guide.

alexlovelltroy commented 9 months ago

Some items to call out here.

alexlovelltroy commented 3 months ago

Closed because SI is complete for this year.

synackd commented 2 months ago

I went back and checked the boxes of user stories we achieved during the SI for posterity.

For the S3 storage in the USRC, this was instead run on each cluster's head node.

For the user stories that were not implemented:

As an administrator, I would like to be able to create/delete/modify a user once and have the user be present on all or a partition of the nodes in order to quickly grant/revoke non-superuser access as needed without requiring a reboot.

The way users were configured in the SI was a bit hacky, provisioning the user database files in cloud-init. Had we had more time, perhaps an LDAP-based (or similar) solution could have been implemented.

As an administrator, I would like to be able to automatically discover compute nodes and add them to the node database.

Magellan was not quite tested and ready by the time node discovery was done in the SI boot camp; however, it became ready afterward and some students used it after their initial cluster setup. It is more mature now, but I left it unchecked because we did not use it in the SI boot camp.

As an administrator, I would like to be able to group nodes. That way, I can assign images to different groups.

Group functionality was not ready during the SI boot camp and so it was not used.