[RFD] Supercomputer Institute User Stories

synackd commented 9 months ago

Context

To better focus efforts to get a usable system working for the Supercomputer Institute, it will help to know what the needs of the SI students will be. Therefore, this RFD serves as a reference of user stories that Ochami should fulfill for SI students, administrators (instructors), and mentors.

Assumptions

Structure

It is assumed that there are several, separate 10-node clusters, each assigned to a group of students (users). Each cluster is distinct from each other and each is relatively small. All of the SI clusters, as well as the configuration management server and distribution server, are behind a single managed switch.

There exists a configuration management server where students provision their cluster (in the past, this meant running Ansible against their head node to configure Warewulf).

There exists a distribution server that is meant to provide services to the clusters, e.g. a package repository, a Git server, head node power/console access (admin-only), etc.

Users

The users being considered are:

Administrator: The person that stands up and administers the cluster. They will be the ones installing and standing up the clusters. From a production lens, they are the system administrators and users of the clusters.
- This role will be assumed by students, as well as mentors who continue student work after the SI concludes.
Cluster User: Users of services on the cluster, i.e. users that run jobs or conduct work (like benchmarking).
- This role will also be assumed students and possibly mentors.

User Stories

Administrator

[x] As an administrator, I would like to be able to SSH into the head node in order to interact with the administrative VM. - This means that any administrative user on the head node will need the ability to authenticate to the head node and sudo permissions sufficient to manage the administrative VM.
[x] As an administrator, I would like to be able to SSH as myself into the compute nodes from the head node without a password in order to troubleshoot the cluster.
[x] As an administrator, I would like to be able to access nodes via hostnames rather than having to remember ip addresses.
[x] As an administrator, I would like to be able to view head and compute node logs in a central location in order to troubleshoot any errors.
[x] As an administrator, I would like nodes to report their health to a central location in order to quickly catch and troubleshoot problems.
[x] As an administrator, I would like access to a hosted gitlab environment that can rebuild compute node images as needed and store them in a known URL-accessible location in order to make new images available for booting compute nodes. - Likely means installation of a new GitLab instance within the USRC, likely our distribution server.
[x] As an administrator, I would like access to an S3-compatible object store in USRC for storing artifacts like boot images that persist.
[x] As an administrator, I would like to be able to easily assign compute node images to compute nodes in order to boot them.
[x] As an administrator, I would like my GitLab pipelines and my S3 buckets to not be interfered with by other SI participants in order to prevent surprises.
[ ] As an administrator, I would like to be able to create/delete/modify a user once and have the user be present on all or a partition of the nodes in order to quickly grant/revoke non-superuser access as needed without requiring a reboot.
[x] As an administrator, I would like to be able to reinstall the head node in an automated way (e.g. Kickstart).
[x] As an administrator, I would like to be able to control the power of the compute nodes from the head node.
[x] As an administrator, I would like to be able to access the serial console of the compute nodes from the head node.
[ ] As an administrator, I would like to be able to automatically discover compute nodes and add them to the node database.
[ ] As an administrator, I would like to be able to group nodes. That way, I can assign images to different groups.
[x] As an administrator, I would like to be able to run commands on all nodes (e.g. ClusterShell).
[x] As an administrator, I would like to be able to install drivers (e.g. HSN drivers like MOFED) within compute node images.

Cluster User

[x] As a cluster user, I would like to be able to access my home directory across all nodes.
[x] As a cluster user, I would like to be able to use Slurm to run jobs.
[x] As a cluster user, I would like to be able to ssh into a node, (head node or a vm) in order to interact with Slurm. This should not require sudo permissions.
[x] As a cluster user, I would like to be able to configure a programming environment for a job in order to compile and then run jobs on the compute nodes. (e.g. using modules like lmod). - Charliecloud/Singularity/etc.. is preferred, but we should also have an option for configuring the compute nodes directly.

synackd commented 9 months ago

We can add/remove/modify as needed.

stradling commented 9 months ago

As an administrator, I would like to be able to quickly create new boot images with human-readable naming schemes using a relatively simple command As an administrator, I would like to be able to download boot images and examine them for issues As an administrator, I would like to be able to watch the image build configuration logs with a relatively simple command As an administrator, I would like to be able to iterate rapidly on image builds As an administrator, I would like to be able to create a staging image to test changes without displacing the present image As an administrator, I would like to be able to build images from arbitrary versions of a given configuration set As an administrator, I would like to be able to be able to separate node config code from node config settings As an administrator, I would like to be able to manage images (deletion, listing) with simple commands As an administrator, I would like to be able to test modifications to node configuration code and settings on a running compute node prior to committing/tagging using a simple workflow As an administrator, I would like to be able to deploy configless Slurm so that compute nodes pick up their slurm.conf automatically

As a cluster user, I would like to be able to ssh to nodes that Slurm is using for my jobs

stradling commented 9 months ago

As an administrator, I would like to be able to prevent cluster users from downloading boot images or configs As an administrator, I would like to be able to prevent cluster users from reaching the management nodes As an administrator, I would like to be able to check node health automatically and take malfunctioning nodes out of service

alexlovelltroy commented 9 months ago

@stradling These are super helpful. We can integrate most of them as further detail on the upcoming task descriptions.

Can you help clarify why each of them matters for our Supercomputer Institute? Some of them feel more advanced than we need, but I may only be thinking about the action you've described without the outcome you're shooting for.

stradling commented 9 months ago

The second set are more advanced. The first set are mostly with my SI lead experience in mind -- those are the complexity points I feel like will stymie students. Heck, they are the points that stymie colleagues in Platforms.

njones-lanl commented 9 months ago

As an administrator, I would like the ability to give each node a specific piece of configuration.

Why it matters for SI: I think there are a lot of configs that students may want to A/B test, one off test, or customize for a specific functionality on a node. In warewulf terms I think this is something like wwsh file import yourfilehere for a specific node. I think this is covered in page 12/13 of chapter 6 of last years SI guide.

alexlovelltroy commented 9 months ago

Some items to call out here.

We will require some resources in the USRC datacenter that are not part of any individual cluster for handling S3/OCI and likely GitLab
@stradling highlighted the need for admins to be able to troubleshoot image builds which makes it clear to me that they need local as well as remote build capabilities.
We will encourage users to run containerized jobs

alexlovelltroy commented 3 months ago

Closed because SI is complete for this year.

synackd commented 2 months ago

I went back and checked the boxes of user stories we achieved during the SI for posterity.

For the S3 storage in the USRC, this was instead run on each cluster's head node.

For the user stories that were not implemented:

As an administrator, I would like to be able to create/delete/modify a user once and have the user be present on all or a partition of the nodes in order to quickly grant/revoke non-superuser access as needed without requiring a reboot.

The way users were configured in the SI was a bit hacky, provisioning the user database files in cloud-init. Had we had more time, perhaps an LDAP-based (or similar) solution could have been implemented.

As an administrator, I would like to be able to automatically discover compute nodes and add them to the node database.

Magellan was not quite tested and ready by the time node discovery was done in the SI boot camp; however, it became ready afterward and some students used it after their initial cluster setup. It is more mature now, but I left it unchecked because we did not use it in the SI boot camp.

As an administrator, I would like to be able to group nodes. That way, I can assign images to different groups.

Group functionality was not ready during the SI boot camp and so it was not used.

OpenCHAMI / roadmap