Redoing the Switch Management at OCT/MOC

hakasapl commented 2 years ago

Currently, the way we manage switches is very disorganized. We have some ansible helper scripts, but there are at least 3 different networks switches are managed from (ie. moc, cloudlab, etc.). We do not currently have a centralized "admin" network where we can access all the switches in the network.

In addition, all the current management networks travel over data links. Meaning a loss of a data link also results in a loss of a management link, which forces on-site intervention.

Proposal

Deploy a new machine (bare metal, VM, or container?) to act as the OCT "head node". Responsibilities of this node:
- L2 access to the switch management network
- Host ansible playbooks for switch management
- Host switch monitoring software (prometheus/grafana, zabbix?)
- Act as a DNS server for switches
- Host identity server for users on switch (all switches would have a backup local user, but this would allow central management of users. Do we want to use an existing identity system?)
Create a new management network with an associated VLAN
- Addressing in this network should be based on physical location. For example, if the address space is 10.220.0.0/18 and the switch is in R1-PC-C04, the address should be 10.220.13.041, where 1 correspond to R1, and 04 correspond to C04, and 3 correspond to pod C (1 for pod A, 2 for B, 3 for C), and the last 1 corresponds to the switch id within the rack
- The physical connections on this network should be entirely seperate from the data fabric that already exists. This means new cable runs everywhere. The hope is that even if data links are failing, management will stay up.
- Each row at MGHPCC we are concerned with should have a distribution switch which is 1000BASE-t (already exists in R1 and R6). It will be a hub and spoke model, where the distribution switches are all bonded together in a VLT trio (ie. R1 distribution has connection to R3 and R6 etc.)

What would we need?

Distribution switch for R3 (may have one laying around, but it would have to be dell to be able to pair with the others), otherwise this switch is ~$800

For now, that's all we need. We have thousands of ft of bulk CAT6 for the new cables, all we need is the patience to terminate them all.

Other Considerations

Security

Putting everything on a single network has security concerns. The switches themselves should have pub key SSH authentication wherever possible. The network itself should not be routable from the outside world and only reachable from a VPN or via the OCT head node.

Collaboration

This OCT network is used by MOC, NERC, NET2, CloudLab, ESI, Chameleon, Operate First, Fabric, NESE, AL2S, Unity Cluster (UMass), and probably more. We would need to properly document the usage of ansible playbooks and the management of the network in general for the multitude of teams.

In addition, the UMass Amherst networking team has an interest in taking part in the management of this infrastructure - they have their own solarwinds setup for their switches, which are all Juniper.

Routing for Others

Many clusters that use this network have their own respective networks for managing switches. This can continue normally over the data fabric. However, we could institute a routing system where clusters are granted access to certain sections of the management network over a router.

hakasapl commented 2 years ago

Part of milestone: https://github.com/CCI-MOC/ops-issues/issues/631

joachimweyl commented 2 years ago

this became an epic https://app.zenhub.com/workspaces/moc-alliance-backlog-62a210f69d42f600151deae0/epics/Z2lkOi8vcmFwdG9yL1plbmh1YkVwaWMvNDQ5Mg

CCI-MOC / ops-issues