Closed hakasapl closed 6 months ago
@hakasapl Just starting to scout this out: There are no Dell nodes in ITR-4. The 9 HP boxes are the only things in there.
@imstof apologies, the 2x dell r740s will be in the rack labeled FXMOC-2
@imstof also note that FLAX put these in the racks back there without rails, so you might have to move the ones above first for the dells
@hakasapl Looks like there are not 3x V70 FPGAs in there. There are: 4x VCK5000 8x V280 1x V70
@imstof Ah, okay, I think we're expecting more V70s but they haven't arrived yet. In that case can you install 4x VCK5000s in the HPs, 5x U280s in the HPs, and 2x U280s in the dells?
@hakasapl I got around to installing the fpgas today, and there are problems with the power connections. The vck5000 have hardwired cables with 1x8 pin and 1x6 pin connectors. The U280 have a port for an 8 pin a connection, but it is keyed differently than any of the pci cables in the HP or Dell boxes The HP machines only have a mini 8 pin connection on the riser card. The dell machines have separate cables with full sized connections, but keyed differently than the fpga connections I don't see anywhere on the system boards or psus to add additional power harnesses.
The vck5000 cables and HP mini-connection: The u280 connection: The Dell connections: HP system board with additional mini 10 pin connection:
@imstof I'll be at MGHPCC this friday. We may have some cables that work with the U280s in the cage that I can pull out for the dell machines at least. The HPs are a first for us so I'll have to look more into it and get back to you. Thanks for your help!
@hakasapl I won't be here tomorrow on account of Veteran's Day. I've left things laid out at the back of the machine room. On a table is one hp server, the 2 dell servers, and one of each fpga type. The hp's riser card and cable are beside the machine.
@hakasapl were you able to get a look at this? should I go ahead with installing cards and racking servers then we can cable them later, or would you rather ensure that we have proper cables first? Also, are there any rails for the dell boxes available or are we just stacking them on top of the hps?
I didn't end up going on Friday as I forgot it's a holiday for me as well. I'll be out there this week, if you could rack them in place and just leave the fpgas in the moc cage for now that would be great!
@hakasapl was this racked and in place, are you getting out to MGHPCC today? if not let's push this out to the next sprint.
Machines are installed in place, but the FPGAs are not yet installed. I will extend this to next sprint
We have ordered a cable to try with the HP machines. @imstof I'll let you know once I've tried this and turn it back over to you. For now this is blocked on me.
@hakasapl is this still blocked on you, do we need to push this out to the next sprint?
I will do some work on it today but it will need to get pushed.
We have found a cable that works with Alveo U280s and the HP DL380 machines. The part number is 869821-001, I've requested to order 9x more. When they come in I will let you know @imstof, and we can complete the installation. In addition, I've left two sets of Dell rails in that cabinet for the dell servers that are not on rails.
I spoke with @imstof today. Cables have come in. We are ready to deploy:
U280s in HPs with the cables that came in today VCKs in the Dells (see if both the V100 gpu and the FPGA can fit together. If so, we'll look into power for running them both)
u280s are in place. there is a another problem with the vck5000s. although they fit length-wise in the dell boxes, the weird way that the wires come from the bottom of the housing cause a bulge which won't allow the lid to go back on the machine. I tried both risers, but doesn't seem to be a way to squeeze the cabling in. (I suppose one could remove the plastic cpu air-flow shield but I that will likely lead to overheating) @hakasapl I can show you when next we're both on site.
@imstof you can just remove the active cooler from that card. We got guidance from AMD that if the server has its own airflow we don't need the additional fan. When that is removed it should fit? No worries if this doesn't happen until the new year. Happy holidays!
@hakasapl that will shorten the length, but I'm not sure if that will allow the card to settle any deeper. It is the width where the cable bundle comes out of the card that was being an issue. I'll be back on site after the new year and will give it a try.
@imstof did you get a chance to try that? No worries if it doesn't fit, I'll look into a different solution.
@hakasapl sorry for late reply. that looks like it should work. I'll tinker with getting the fan off of one of the units and confirm.
Note for Hakan: Check with loading dock for new servers on Thursday
@imstof the remaining nodes that house fpgas are in the loading dock. Can you grab them and install them in the same rack when you have time? They should be 8x Dell R760s
looks like I negleted to submit the update I wrote out last friday. we are here: nodes are racked 2 x vck5000 installed but cables are incorrect for motherboard (see image) 4 x v70 installed I noticed a major issue when looking at cabling. All of these dell machines psu are c19/20 connectors. the pdus in that rack have no c19/20.
re: vck5000 cables. the cables we have are correct for the fpga, but the motherboard on these dell boxes does not use molex. The image shows the mobo connection and a sample of a cable that came with the motherboard for a standard gpu.
Thanks @imstof - can you work on cabling the existing nodes (not the new ones) for now, but leave out the power cables? I will figure out if we can swap the PDU.
@hakasapl what are the next steps for this issue?
I have contacted dell support for an internal pcie power cable that works for the FPGAs. They are trying to figure it out, so I don't consider these installed until those are available as well.
Let's move the 2x A100 and 2x VCK FPGAs in each R740xd up to one of the R760s (still waiting on power cables, but the one included with the R760 should work for the GPU at least)
U280s can replace the FPGAs in the R740xds
Initial round of FPGAs are installed, I will open another issue down the road for the stuff we are looking for cables for
9x HP DL380 Gen 10 2x Dell R740
These nodes are located in the rack labeled "ITR-4" in the back of the machine room. This will not be a rack swap, each node will be installed into the existing R3-PB-C05 rack, which is keyed for the MOC keyring.
These will need to have FPGAs installed in them. There are 13 FPGAs total, but we can only install 11 of them for now. Each FPGA requires its own ATX power connector. The nodes listed above have GPUs in them already. If it's possible to run both the FPGA and the GPU in a single node, we'd prefer to keep the GPU. Otherwise, replace the GPU with the FPGA (Let me know if we need to purchase ATX power connectors if running both for the HPs, we already have Dell ATX power connectors). The FPGAs should be installed as follows:
The 13x FPGAs (only installing 11x for now) are located in the MOC cage. They are the boxes sitting above the plastic drawers (some of them overflowing to the top shelf, but look the same size).
In R3-PB-C05, start with the HPs from U1, then the dells on top of the HPs
cc: @imstof