CCI-MOC / ops-issues

2 stars 0 forks source link

Install new FPGA machines in R3-PB-C05 #1166

Closed hakasapl closed 6 months ago

hakasapl commented 11 months ago

9x HP DL380 Gen 10 2x Dell R740

These nodes are located in the rack labeled "ITR-4" in the back of the machine room. This will not be a rack swap, each node will be installed into the existing R3-PB-C05 rack, which is keyed for the MOC keyring.

These will need to have FPGAs installed in them. There are 13 FPGAs total, but we can only install 11 of them for now. Each FPGA requires its own ATX power connector. The nodes listed above have GPUs in them already. If it's possible to run both the FPGA and the GPU in a single node, we'd prefer to keep the GPU. Otherwise, replace the GPU with the FPGA (Let me know if we need to purchase ATX power connectors if running both for the HPs, we already have Dell ATX power connectors). The FPGAs should be installed as follows:

The 13x FPGAs (only installing 11x for now) are located in the MOC cage. They are the boxes sitting above the plastic drawers (some of them overflowing to the top shelf, but look the same size).

In R3-PB-C05, start with the HPs from U1, then the dells on top of the HPs

cc: @imstof

imstof commented 11 months ago

@hakasapl Just starting to scout this out: There are no Dell nodes in ITR-4. The 9 HP boxes are the only things in there.

hakasapl commented 11 months ago

@imstof apologies, the 2x dell r740s will be in the rack labeled FXMOC-2

hakasapl commented 11 months ago

@imstof also note that FLAX put these in the racks back there without rails, so you might have to move the ones above first for the dells

imstof commented 11 months ago

@hakasapl Looks like there are not 3x V70 FPGAs in there. There are: 4x VCK5000 8x V280 1x V70

hakasapl commented 11 months ago

@imstof Ah, okay, I think we're expecting more V70s but they haven't arrived yet. In that case can you install 4x VCK5000s in the HPs, 5x U280s in the HPs, and 2x U280s in the dells?

imstof commented 11 months ago

@hakasapl I got around to installing the fpgas today, and there are problems with the power connections. The vck5000 have hardwired cables with 1x8 pin and 1x6 pin connectors. The U280 have a port for an 8 pin a connection, but it is keyed differently than any of the pci cables in the HP or Dell boxes The HP machines only have a mini 8 pin connection on the riser card. The dell machines have separate cables with full sized connections, but keyed differently than the fpga connections I don't see anywhere on the system boards or psus to add additional power harnesses.

The vck5000 cables and HP mini-connection: vck5000_connector The u280 connection: u280_connector The Dell connections: dell_pci_connector HP system board with additional mini 10 pin connection: HP_mobo

hakasapl commented 11 months ago

@imstof I'll be at MGHPCC this friday. We may have some cables that work with the U280s in the cage that I can pull out for the dell machines at least. The HPs are a first for us so I'll have to look more into it and get back to you. Thanks for your help!

imstof commented 11 months ago

@hakasapl I won't be here tomorrow on account of Veteran's Day. I've left things laid out at the back of the machine room. On a table is one hp server, the 2 dell servers, and one of each fpga type. The hp's riser card and cable are beside the machine.

imstof commented 11 months ago

@hakasapl were you able to get a look at this? should I go ahead with installing cards and racking servers then we can cable them later, or would you rather ensure that we have proper cables first? Also, are there any rails for the dell boxes available or are we just stacking them on top of the hps?

hakasapl commented 11 months ago

I didn't end up going on Friday as I forgot it's a holiday for me as well. I'll be out there this week, if you could rack them in place and just leave the fpgas in the moc cage for now that would be great!

joachimweyl commented 11 months ago

@hakasapl was this racked and in place, are you getting out to MGHPCC today? if not let's push this out to the next sprint.

hakasapl commented 11 months ago

Machines are installed in place, but the FPGAs are not yet installed. I will extend this to next sprint

hakasapl commented 10 months ago

We have ordered a cable to try with the HP machines. @imstof I'll let you know once I've tried this and turn it back over to you. For now this is blocked on me.

joachimweyl commented 10 months ago

@hakasapl is this still blocked on you, do we need to push this out to the next sprint?

hakasapl commented 10 months ago

I will do some work on it today but it will need to get pushed.

hakasapl commented 10 months ago

We have found a cable that works with Alveo U280s and the HP DL380 machines. The part number is 869821-001, I've requested to order 9x more. When they come in I will let you know @imstof, and we can complete the installation. In addition, I've left two sets of Dell rails in that cabinet for the dell servers that are not on rails.

hakasapl commented 10 months ago

I spoke with @imstof today. Cables have come in. We are ready to deploy:

U280s in HPs with the cables that came in today VCKs in the Dells (see if both the V100 gpu and the FPGA can fit together. If so, we'll look into power for running them both)

imstof commented 10 months ago

u280s are in place. there is a another problem with the vck5000s. although they fit length-wise in the dell boxes, the weird way that the wires come from the bottom of the housing cause a bulge which won't allow the lid to go back on the machine. I tried both risers, but doesn't seem to be a way to squeeze the cabling in. (I suppose one could remove the plastic cpu air-flow shield but I that will likely lead to overheating) @hakasapl I can show you when next we're both on site.

hakasapl commented 10 months ago

@imstof you can just remove the active cooler from that card. We got guidance from AMD that if the server has its own airflow we don't need the additional fan. When that is removed it should fit? No worries if this doesn't happen until the new year. Happy holidays!

imstof commented 10 months ago

@hakasapl that will shorten the length, but I'm not sure if that will allow the card to settle any deeper. It is the width where the cable bundle comes out of the card that was being an issue. I'll be back on site after the new year and will give it a try.

hakasapl commented 9 months ago

@imstof did you get a chance to try that? No worries if it doesn't fit, I'll look into a different solution.

imstof commented 9 months ago

@hakasapl sorry for late reply. that looks like it should work. I'll tinker with getting the fan off of one of the units and confirm.

hakasapl commented 9 months ago

Note for Hakan: Check with loading dock for new servers on Thursday

hakasapl commented 8 months ago

@imstof the remaining nodes that house fpgas are in the loading dock. Can you grab them and install them in the same rack when you have time? They should be 8x Dell R760s

IMG_1680

imstof commented 8 months ago

looks like I negleted to submit the update I wrote out last friday. we are here: nodes are racked 2 x vck5000 installed but cables are incorrect for motherboard (see image) 4 x v70 installed I noticed a major issue when looking at cabling. All of these dell machines psu are c19/20 connectors. the pdus in that rack have no c19/20.

re: vck5000 cables. the cables we have are correct for the fpga, but the motherboard on these dell boxes does not use molex. The image shows the mobo connection and a sample of a cable that came with the motherboard for a standard gpu. 20240215_143742

hakasapl commented 7 months ago

Thanks @imstof - can you work on cabling the existing nodes (not the new ones) for now, but leave out the power cables? I will figure out if we can swap the PDU.

https://github.com/CCI-MOC/ops-issues/issues/1251

joachimweyl commented 6 months ago

@hakasapl what are the next steps for this issue?

hakasapl commented 6 months ago

I have contacted dell support for an internal pcie power cable that works for the FPGAs. They are trying to figure it out, so I don't consider these installed until those are available as well.

hakasapl commented 6 months ago

Let's move the 2x A100 and 2x VCK FPGAs in each R740xd up to one of the R760s (still waiting on power cables, but the one included with the R760 should work for the GPU at least)

U280s can replace the FPGAs in the R740xds

hakasapl commented 6 months ago

Initial round of FPGAs are installed, I will open another issue down the road for the stuff we are looking for cables for