Broader review and revision of ESI Network straw man proposal

msdisme commented 1 year ago

As the MOC alliance grows, and ESI approaches widespread deployment, a major prerequisite we need is forming a shared network at MGHPCC. Understanding that there are different network engineers responsible for different hardware at MGHPCC, we need to identify a network architecture that will work for a shared system like the one we are forming.

Straw-man proposal for ESI shared network - https://docs.google.com/document/d/1XkIHLCTTzF5Hbejln-o9UBScO_1ip5XqXD7ZWIA0wxo/edit?usp=sharing and give feedback

msdisme commented 1 year ago

From @okrieg on Nov 24: I read over this: https://docs.google.com/document/d/1XkIHLCTTzF5Hbejln-o9UBScO_1ip5XqXD7ZWIA0wxo/edit

Sorry to take me so long; am insecure/intimidated whenever I have to think about networking since my knowledge has been gained experientially rather than by really understanding it. I don’t think I get it enough to give detailed comments, and if nothing else that points out how we need something like this to get on a common set of assumptions/understanding. Thanks for starting on this.

Would it make sense to meet one-on-one for an hour, so I can ask lots of dumb questions and so the two of us can either get to the same set of assumptions that we can document for everyone else, or actually find we have different ones, which would be good to put down.

Feels like it might be good to get, in addition to our team (Naved, Lars…) also john Goodhue and Heidi/Mike. The networking guys at Harvard and BU tend to be religious about L3, and I want to make sure we have common perspective before adding them.

Let me jot down a few of my random thoughts, that its not clear are consistent with what you have. I am asserting these, but, doesn’t mean I am religious, happy to be wrong…

Each institution will continue to have substantial infrastructure that they fully control all networking of. I imagine we would have a high speed boundary switch (BS appropriately) that could do VLAN mapping that each of the institutional networks, and the ESI frabric is connected to. It is under full control of institution if any VLANs, including whatever public factings ones, are exposed to BS (I really like that name) If an institutional cluster wants to get more computers from ESI, they would: ask their institution to expose the VLANs they are using to BS allocate nodes and VLANs from ESI tell BS to map the internal ESI VLAN to the institutional VLAN. The mapping would be infrequent, and could be manual, since the decision of a cluster to choose to use ESI or not is a strategic one. If there is a conflict where two institutions are using the same VLAN that they both want to expose to BS, then they would need to do another layer of mapping on their side. There may be multiple ESI instances, e.g., for fault tolerance, or because one is HIPAA complaint, if so, they would both be connected to BS; below talking about ESI as a single thing though. ESI has a set of switches connected to each other by physical cables that we have strung between them, including both the uplinks between them, and the management network to the switches, and the management network for OBM… All VLANs go to all the different TOR switches, and ESI is basically managing the TOR switches to expose specific VLANs to ports of specific hosts The management network for OBM for all the hosts is a flat unmanaged network, which ESI talks to PXE booting goes over the data network,

Having written this down, now re-read your doc, and I think I get why I didn’t understand it first reads, and where the disconnect is. You are not talking about the mapping to institutional networks, or the BS between them; I was thinking the document was talking about shared network in general, but this is the ESI internal network. Also, I think you are thinking about multiple projects/zones/responsibilities within an ESI, and I am really allergic to that.

Anyhow, should be around working Friday and weekend, and a bit this morning. Feel free to reach out on slack.

msdisme commented 1 year ago

From @hakasapl on Nov 24, 2022: Probably captured in next commment, but leaving in for now Hi,

I agree, it would be great to assert some common points on the doc as well.

Happy to meet one on one, what times are you available on Monday?

Some of my thoughts after reading your points:

I really like the “boundary switch”/BS name as well, and I’ll integrate that into the doc
- Concerning the BS, we already have 4x core switches deployed in OCT that we can make use of. All of them are 100G switches.
The part I’m not 100% confident on is the VLAN translation. If we want to stick to L2 across all the ESI tenants, we at some point will need to do some VLAN translation on the BS switches. Our Dell Z9100-ON switches support QinQ/VLAN stacking, so that’s certainly doable, but I’m not 100% sure if there are any performance issues to note. The other option is to have a master ESI vlan list and any VLANs exposed to the BS switches should not have conflicts.
Because of how I initially set up OCT, the test ESI instance is already connected to the BS switches, and this model is how we have connected BU/MOC to these switches. It’s now NESE, BU, UMass, and MIT CSAIL can talk to each other in the context of the MOC alliance currently, so the thing to figure out is more the policy, since the actual configuration has already been done and proven to work.
The OBM network can go over the data plane as well through the BS switches. Traditionally, I’d prefer OBM to be a discrete fabric separate from the data plane, and we can certainly do that, but we’d need to include a BS switch which is 1G for the OBM stuff. Because of the resiliency of the current BS switch set up (4x switches, any 2 can go completely offline without anything going offline, just the bandwidth halves).

In general, I’m trying to avoid central management of all the tenant’s individual networks from one place. So, in my opinion multiple responsibilities and zones within the larger ESI network makes sense. This will make it much easier for tenants to be a part of ESI without worrying about giving other admins access to the control plane of their networks, and assuming ESI will be used outside of the MOC alliance at some point, it would be valuable to define a model where tenants keep control over their networks, and they only allow the ESI controller to set TORS VLANs. I’m happy to deliberate that more, I’m not 100% sold on that design theory.

Let me know about your availability and Happy Thanksgiving!

Hakan

msdisme commented 1 year ago

From @okrieg Nov 25, 2022: Embedded:

Hi,

I agree, it would be great to assert some common points on the doc as well.

Happy to meet one on one, what times are you available on Monday?

Some of my thoughts after reading your points:

I really like the “boundary switch”/BS name as well, and I’ll integrate that into the doc
- Concerning the BS, we already have 4x core switches deployed in OCT that we can make use of. All of them are 100G switches.
The part I’m not 100% confident on is the VLAN translation. If we want to stick to L2 across all the ESI tenants, we at some point will need to do some VLAN translation on the BS switches. Our Dell Z9100-ON switches support QinQ/VLAN stacking, so that’s certainly doable, but I’m not 100% sure if there are any performance issues to note. The other option is to have a master ESI vlan list and any VLANs exposed to the BS switches should not have conflicts.

I thought QinQ adds another layter of tagging, so will cost something, also, it doesn’t seem what we need, i.e., we want to use specific VLANs at one side and different VLANs at the other. I was assuming VLAN translation like this, that I assume would be free. IS this just a CISCO feature? Heidi mentioned to me it is pretty common, and getting one switch that does this seems worthwhile for our edge. I guess the challenge with VLAN translation is that the user if they want to use tagged ports, it will be visible to the OS that the VLANs inside the ESI side are different from the VLANs on the institutional side, but… that seems worth asking if it is a problem.

Because of how I initially set up OCT, the test ESI instance is already connected to the BS switches, and this model is how we have connected BU/MOC to these switches. It’s now NESE, BU, UMass, and MIT CSAIL can talk to each other in the context of the MOC alliance currently, so the thing to figure out is more the policy, since the actual configuration has already been done and proven to work.

Yes, what you have done works great for a set of collaborating/green field projects. However, if we want to allow projects that already exist, e.g., HPC clusters at BU…. To use this on-demand, I think we need to accommodate some mapping to their existing VLANs.

The OBM network can go over the data plane as well through the BS switches. Traditionally, I’d prefer OBM to be a discrete fabric separate from the data plane, and we can certainly do that, but we’d need to include a BS switch which is 1G for the OBM stuff. Because of the resiliency of the current BS switch set up (4x switches, any 2 can go completely offline without anything going offline, just the bandwidth halves).

Not sure I understand what the question/statement is?

In general, I’m trying to avoid central management of all the tenant’s individual networks from one place. So, in my opinion multiple responsibilities and zones within the larger ESI network makes sense. This will make it much easier for tenants to be a part of ESI without worrying about giving other admins access to the control plane of their networks, and assuming ESI will be used outside of the MOC alliance at some point, it would be valuable to define a model where tenants keep control over their networks, and they only allow the ESI controller to set TORS VLANs. I’m happy to deliberate that more, I’m not 100% sold on that design theory.

Don’t get it. My intuition goes the other way, but we might have a different vision on what a tenant/provider is. Here is the way I think about it. We have been prototyping ESI as a control system for machines connected over L2 networking that we have stiched together as we could. Longer term, I view ESI as being the L2 control plane for a group of tightly coupled machines in a single POD. If different insitutions want to loan out machines, they will create their own ESI PODs. No matter what we do to avoid single points of failure, a configuration bug, or software error, can bring down ESI, and we will have a single set of administrators for it; so I don’t view it long term as being something where an institution will just add a rack to an existing ESI. Instead, my thinking is that the institution will have another ESI, with a few racks in it. I get the reason why you are proposing what you are proposing, i.e., in your model multiple entities can just add a rack to ESI and have it overlayed over their existing networking so ESI just controls the TOR. However, I am tempted to go to the simplest solution possible, us directly wiring racks together and they are part of the ESI world… if we need to support dicontiguous ESI racks. That way, some mistake by a participant doesn’t screw up ESI, and there is no chance for ESI to screw up their internal institutional networking.

Hmmm, now that I get what you are thinking, I like it, and I see how it could work and would provide a more natural path to ESI growth than what I was/am thinking, but, my gut feeling is that we should not do it, but instead have a model of a strict boundary between ESI and other networking with BS isolating things, and if needed multiple ESI PODs. My gut is that there are two many failure modes in what you are thinking where a participant error will bring down all of ESI, or where the ESI admin error could bring down tenant, since you are stiching ESI over the tenant fabrics.

Also, long term, what I see is new networking designs giving super efficient L2 networking between a group of tightly coupled machines; lets call that a POD. There are new synchronous networks from google… I see that same grouping as being a natural failure domain, where we guarantee no correlated failures across PODs. So, user allocate machines on different PODs to be resilient. Then, I see a unit like that has being natural for scalability, since at the end L2 doesn’t scale. Finally, a unit like a POD is a natural boundary for an insittuiton to operate. So, I see ESI as a management unit for a group of closely coupled co-located machines, and users allocate machines from different PODs and stich them together by either VLAN mapping or, by just doing L3 on top of the L2 within POD.

Oops, sorry for this being so long, you made me think 😊

Let me know about your availability and Happy Thanksgiving!

Hakan

msdisme commented 1 year ago

@hakasapl is the doc here: https://docs.google.com/document/d/1XkIHLCTTzF5Hbejln-o9UBScO_1ip5XqXD7ZWIA0wxo/edit ready for new review?

hakasapl commented 1 year ago

Yes, this is ready for initial review

hakasapl commented 1 year ago

I will be seeking feedback from the ESI development team.

hakasapl commented 1 year ago

Yes, there are comments that I have replied to on the doc from the ESI team

msdisme commented 1 year ago

Working on MTC hardware requirement to begin gathering quotes. @syockel @jtriley please review the straw man this week if possible (NLT Jun 17 2023). Link to doc is: https://docs.google.com/document/d/1XkIHLCTTzF5Hbejln-o9UBScO_1ip5XqXD7ZWIA0wxo/edit?usp=sharing

msdisme commented 8 months ago

@hakasapl I think this is being implemented - ok to close?

hakasapl commented 8 months ago

Yes, I'll close

CCI-MOC / ops-issues

Broader review and revision of ESI Network straw man proposal #740