Closed hakasapl closed 3 weeks ago
Initial plan is covered in this diagram:
Updated date estimates:
There are 3 diagrams: First is the MOCA network as it stands today, second is the MOCA network with NERC core switches when they are fully set up. The third is the H100s only and how it connects back to the main diagram.
Some context for the existing network:
The existing MOCA network is split into 3 distinct "islands", where each island belongs to an entity, such as NERC. As the MOC alliance is involved in several projects, this method proved to keep the largely L2 network organized as it grows. Each island is its own spanning tree instance with its own core. The cores connect to each other to share any traffic that needs to traverse islands. This is designed as a hub and spoke topology. The H100 deployment is planned to be a spine leaf topology, which was brought up as a requirement to improve inter-rack bandwidth between GPUs.
To fit this topology into our existing network the plan is to treat the NERC core switches as border leafs in the H100 spine leaf topology. This way the H100 network can behave like a spine leaf while also interacting with the existing network. The main limitation with this approach is bandwidth to the core, which sits at 400G at its current state. This should be enough initially because the NESE (Northeast Storage Exchange) storage being offered externally does not comsume very much bandwidth. This may become an issue later down the road depending on whether new storage offerings will consume more throughput, although at that point it is also possible to add any new external storage to the H100 spine leaf topology directly.
@hakasapl is the plan developed or is there more needed for this issue?
I would like to complete the design review before closing this. @hpdempsey is scheduling
Design review is scheduled for 9AM EDT Thurs 10/3.
Design review was completed. Open questions from the review:
As part of the planning with Lenovo we need to confirm the bill of materials with them. This involves confirming the plan for networking.