Currently, the way we manage switches is very disorganized. We have some ansible helper scripts, but there are at least 3 different networks switches are managed from (ie. moc, cloudlab, etc.). We do not currently have a centralized "admin" network where we can access all the switches in the network.
In addition, all the current management networks travel over data links. Meaning a loss of a data link also results in a loss of a management link, which forces on-site intervention.
Proposal
Deploy a new machine (bare metal, VM, or container?) to act as the OCT "head node". Responsibilities of this node:
Host identity server for users on switch (all switches would have a backup local user, but this would allow central management of users. Do we want to use an existing identity system?)
Create a new management network with an associated VLAN
Addressing in this network should be based on physical location. For example, if the address space is 10.220.0.0/18 and the switch is in R1-PC-C04, the address should be 10.220.13.041, where 1 correspond to R1, and 04 correspond to C04, and 3 correspond to pod C (1 for pod A, 2 for B, 3 for C), and the last 1 corresponds to the switch id within the rack
The physical connections on this network should be entirely seperate from the data fabric that already exists. This means new cable runs everywhere. The hope is that even if data links are failing, management will stay up.
Each row at MGHPCC we are concerned with should have a distribution switch which is 1000BASE-t (already exists in R1 and R6). It will be a hub and spoke model, where the distribution switches are all bonded together in a VLT trio (ie. R1 distribution has connection to R3 and R6 etc.)
What would we need?
Distribution switch for R3 (may have one laying around, but it would have to be dell to be able to pair with the others), otherwise this switch is ~$800
For now, that's all we need. We have thousands of ft of bulk CAT6 for the new cables, all we need is the patience to terminate them all.
Other Considerations
Security
Putting everything on a single network has security concerns. The switches themselves should have pub key SSH authentication wherever possible. The network itself should not be routable from the outside world and only reachable from a VPN or via the OCT head node.
Collaboration
This OCT network is used by MOC, NERC, NET2, CloudLab, ESI, Chameleon, Operate First, Fabric, NESE, AL2S, Unity Cluster (UMass), and probably more. We would need to properly document the usage of ansible playbooks and the management of the network in general for the multitude of teams.
In addition, the UMass Amherst networking team has an interest in taking part in the management of this infrastructure - they have their own solarwinds setup for their switches, which are all Juniper.
Routing for Others
Many clusters that use this network have their own respective networks for managing switches. This can continue normally over the data fabric. However, we could institute a routing system where clusters are granted access to certain sections of the management network over a router.
Currently, the way we manage switches is very disorganized. We have some ansible helper scripts, but there are at least 3 different networks switches are managed from (ie. moc, cloudlab, etc.). We do not currently have a centralized "admin" network where we can access all the switches in the network.
In addition, all the current management networks travel over data links. Meaning a loss of a data link also results in a loss of a management link, which forces on-site intervention.
Proposal
What would we need?
For now, that's all we need. We have thousands of ft of bulk CAT6 for the new cables, all we need is the patience to terminate them all.
Other Considerations
Security
Putting everything on a single network has security concerns. The switches themselves should have pub key SSH authentication wherever possible. The network itself should not be routable from the outside world and only reachable from a VPN or via the OCT head node.
Collaboration
This OCT network is used by MOC, NERC, NET2, CloudLab, ESI, Chameleon, Operate First, Fabric, NESE, AL2S, Unity Cluster (UMass), and probably more. We would need to properly document the usage of ansible playbooks and the management of the network in general for the multitude of teams.
In addition, the UMass Amherst networking team has an interest in taking part in the management of this infrastructure - they have their own solarwinds setup for their switches, which are all Juniper.
Routing for Others
Many clusters that use this network have their own respective networks for managing switches. This can continue normally over the data fabric. However, we could institute a routing system where clusters are granted access to certain sections of the management network over a router.