Open jameshcorbett opened 1 year ago
Slingshot requires the use of VNIs (think of VLANs). If you use the same VNI for everything, eventually you exhaust endpoints on the switches. Slurm will be using one VNI per job step.
For Flux:
Day 0: pre-allocate X VNIs per sub-instance, then launches within that sub-instance round-robin across VNIs
Day 1: allow users to request extra VNIs per sub-instance
Day 2: bespoke setuid binary that when run at top-level will do anything but when run as a user, it is limited to what the top-level constrained things to
VNI tagging was brought up again recently in a (not public) TOSS issue: https://lc.llnl.gov/jira/browse/TOSS-5932
This statement from the issue seemed like a good description of the problem:
As part of the changes that implement VNI tagging on the HPE Slingshot NIC, as of Slingshot 2.0.1 the default CXI Service has been disabled. This means that deployments must implement additional host-side configuration (using the job scheduler plug-ins for example) implement VNI tagging, or explicitly re-enable the default service to have applications operate as in previous releases. (This also means that CXI diagnostics need to pass the VNI information on the command line). HPE recommends fully implementing VNI tagging for isolating RDMA traffic to protect against memory writes from nodes not known to be part of the job. Refer to Section 8.3. of the HPE Slingshot Operations Guide - Customer for more information.
I don't see the context elsewhere, or another issue, so I'll add it here. We need to implement VNI assignment at least local to each node. The switch reconfiguration, which is the part that has performance and interface concerns, we don't need to deal with, but it's also possible to exhaust resources on the NIC if we don't. From what I understand, there are two parts to this.
In principle, as a start at least, I think we could actually just do (2) and it would work, but it wouldn't provide any protection against inappropriate cross-job/cross-user RDMAs.