CCI-MOC / ops-issues

2 stars 0 forks source link

Evaluate if our current hardware can support AMD GPUs #1344

Open msdisme opened 3 weeks ago

msdisme commented 3 weeks ago

AMD is planning to initially provide mi100s. https://www.amd.com/en/products/accelerators/instinct/mi100.html

hakasapl commented 2 weeks ago

The mi100s support both PCIe 3.0 and 4.0. Question for AMD (Will @hpdempsey relay this?): "will we lose performance if we run this card off a PCIe 3.0 bus"?

We don't have anything that can run this on a 4.0 bus. If it's okay to stick to a 3.0 bus my recommendation is to:

  1. Decommission half of the V100 nodes in NERC openstack, take the V100 GPUs out of the decommissioned nodes and install them in the remaining V100 nodes (those are Dell R740s that can support 2 GPUs each). That leaves the same amount of V100s in NERC openstack but half the nodes (which is also more space-efficient, as a single GPU per 2U is really not space-efficient at all).
  2. Add the new AMD cards to the now empty R740s and add them into NERC openshift (2 per node)

If we'd prefer to run these at 4.0 speeds, we'd need to buy something new. I recommend something along the lines of Dell R760 (the last quote I got for one of these was $12k each), which can run 3 GPUs simultaneously on 4.0 bus. Or we can check with FLAX what whitebox solutions they have, which is probably a lot cheaper.

msdisme commented 2 weeks ago

For our purposes the 3.0 is fine. Can they handle 2 per system (300 watts peak for each card)?

@hpdempsey any reason not to double them up?

msdisme commented 2 weeks ago

@hakasapl I am not 100% sure how many cards we are getting yet, but I think it is 4 or 8.

  1. Is my read of the 16 v100 node correct?
  2. Any problems with heat/power for the rack if we increase the amount of power draw on those 740's?
hakasapl commented 2 weeks ago

@msdisme I think there are 19 R740xds in that rack (each with 1 V100). I don't see any issues with doubling up on the GPUs. There are several air cooled racks with even higher GPU density in the facility already.

hpdempsey commented 2 weeks ago

Are the v100s in OpenStack currently being used by anyone? (asking because of downtime impact on charging).