CCI-MOC / ops-issues

2 stars 0 forks source link

Prepare Lenovo GPU management server to become a node in our cluster #1170

Open msdisme opened 10 months ago

msdisme commented 10 months ago

The management node for the Watercooled GPUs is already at the mghpcc:

| ThinkSystem SR250 V2 | 7D7QCTO1WW | 10/16 | Shipped 10/10   1Z4E3W210337764423 |

from lenovo web site above: 2x RJ45 Gigabit Ethernet ports, 1x 1GbE dedicated XCC port for remote management

joachimweyl commented 10 months ago

@jtriley any update on a location for this server?

joachimweyl commented 10 months ago

2-B-20 is a possibility, looking into it.

joachimweyl commented 9 months ago

@aabaris I assigned you as it sounds like you will be looking into this tomorrow while you are at MGHPCC.

aabaris commented 9 months ago
msdisme commented 9 months ago

@jtriley should we also have a task for copying an imag of the system?

msdisme commented 9 months ago

feedback from Lenovo, that they do need this set up and accessible. please don't delete what is on there yet

msdisme commented 9 months ago

Date on Lenovo GPU Deliveries January 20 - work going on appeal and make is sooner.

joachimweyl commented 8 months ago

@aabaris Do we know what cables are needed to connect this?

aabaris commented 8 months ago

@aabaris Do we know what cables are needed to connect this?

Without having an exact serial number, but relying on model number (7D7QCTO1WW) we can expect to have 3 x RJ45 1Gb connections.

1x MGMT(XCC) - cable CAT5e 2x 1Gb/s Data port, cables CAT5e

According to my notes, there are 3 switches in r2-pb-c20 2 x MGHPCC-RC-2-B-20-SW1A+B (Cisco Nexus 9000 C93180YC-EX) - These only have SFP+ interfaces, not compatible with with RJ45-CAT5e connections without a specail transceiver. 1x MGHPCC-RC-2-B-20-SW2 (Dell N3048) with 48x RJ45 (8P8C) port - This switch can be used with a regular cat5e cable, but we need to make sure the switch is capable of providing access to all the VLANs we need to connect to.

@jtriley could we ask Nick or Christian if MGHPCC-RC-2-B-20-SW2 (Dell N3048) is capable of providing all networking we need for bringing this server online (specifically - this is an OBM connection switch, are we going to be able to use it connect server data ports to it?)

aabaris commented 8 months ago

@msdisme @joachimweyl It would help us to have the actual serial # of this server in order to plan network connections. Any chance you guys have access to it?

aabaris commented 8 months ago

Lenovo Server ThinkSystem SR250 V2, serial # J1058LLP, has been installed in r2-pb-c20 U26

It's management interface is cabled via cat5/rj45 link to switch MGHPCC-RC-2-B-20-SW2 -> port 11, and is reachable via ip address 10.30.0.124

2x 10Gb data paths are plugged in from pci slot1 broadcom card to MGHPCC-RC-2-B-20-SW1A -> Ethernet1/19 and MGHPCC-RC-2-B-20-SW1B -> Ethernet1/19 via TwinAX(DAC) cables that I have stolen.

2x PSU ports are plugged into L1-P16 port of both PDUs in that rack.

aabaris commented 8 months ago

lenvo-mgmt-node

msdisme commented 8 months ago

Sorry Augustine, I missed the earlier request for Serial Number - I did not have it, but will see about getting details for all of the Lenovo HW.

aabaris commented 8 months ago

@msdisme and @joachimweyl

Do we have any additional specifications from Lenovo on how this node should be networked? There are 6 network interfaces to choose from, and several networks they could be attached to. I will need to make requests for VLAN and IP allocations from networking folks via @jtriley as a proxy, but need further clarification on what is expected by Lenovo.

msdisme commented 8 months ago

I will ask that as part of the mail I am sending them (later today or more likely tomorrow) and will include this in the email. Let me know if it needs rk be higher priority.

joachimweyl commented 7 months ago

As this is not going to be used as a management server we can choose what networking is needed. they are letting us use it until something changes in our system and a management server would be useful.

joachimweyl commented 6 months ago

Justin confirmed that there is nothing more we need others to do for this, he will prepare it to be a node for our cluster when he has some downtime.