flux-framework / flux-coral2

Plugins and services for Flux on CORAL2 systems
GNU Lesser General Public License v3.0
8 stars 6 forks source link

MPI: Integrate with HPE's CXI library for allocating VNIs #24

Open jameshcorbett opened 1 year ago

jameshcorbett commented 1 year ago

Per the latest batch of emails with Cray, it looks like the Shasta APIs can be used a la carte by the WLM.

APIs that seem like ones that we would use:

HMS's Hardware Inventory API (combined with the ClusterStor Inventory API)
The ATOM node health check API

Based on the latest emails and discussions, we will not be using either of these interfaces. We will only be using the node health checks through the default TOSS4 utility (nodediag?).

CXI library for allocating/setting up VNIs - requires root

jameshcorbett commented 1 year ago

Slingshot requires the use of VNIs (think of VLANs). If you use the same VNI for everything, eventually you exhaust endpoints on the switches. Slurm will be using one VNI per job step.

For Flux:

Day 0: pre-allocate X VNIs per sub-instance, then launches within that sub-instance round-robin across VNIs
Day 1: allow users to request extra VNIs per sub-instance
Day 2: bespoke setuid binary that when run at top-level will do anything but when run as a user, it is limited to what the top-level constrained things to
garlick commented 1 year ago

See also: https://github.com/SchedMD/slurm/tree/master/src/plugins/switch/hpe_slingshot

garlick commented 1 year ago

VNI tagging was brought up again recently in a (not public) TOSS issue: https://lc.llnl.gov/jira/browse/TOSS-5932

This statement from the issue seemed like a good description of the problem:

As part of the changes that implement VNI tagging on the HPE Slingshot NIC, as of Slingshot 2.0.1 the default CXI Service has been disabled. This means that deployments must implement additional host-side configuration (using the job scheduler plug-ins for example) implement VNI tagging, or explicitly re-enable the default service to have applications operate as in previous releases. (This also means that CXI diagnostics need to pass the VNI information on the command line). HPE recommends fully implementing VNI tagging for isolating RDMA traffic to protect against memory writes from nodes not known to be part of the job. Refer to Section 8.3. of the HPE Slingshot Operations Guide - Customer for more information.

trws commented 1 month ago

I don't see the context elsewhere, or another issue, so I'll add it here. We need to implement VNI assignment at least local to each node. The switch reconfiguration, which is the part that has performance and interface concerns, we don't need to deal with, but it's also possible to exhaust resources on the NIC if we don't. From what I understand, there are two parts to this.

  1. Actually set up a range of VNIs on the NIC. I think slurm does this per-job-step, but for us it makes much more sense to set up a range per system-level job that over-allocates.
  2. Use the appropriate environment variable to set the offset into the VNI range the current job should use. This doesn't require any privilege, so we can do this in every level below the system level.

In principle, as a start at least, I think we could actually just do (2) and it would work, but it wouldn't provide any protection against inappropriate cross-job/cross-user RDMAs.