CCI-MOC / ops-issues

2 stars 0 forks source link

A100 for testing the GA candidate release for RHELAI with InstructLab #1369

Open tssala23 opened 2 weeks ago

tssala23 commented 2 weeks ago

I will be testing using instruct lab on the RHEL AI image provided from the RHEL AI team. Currently instruct lab only runs on GPU nodes. The A100 can be assigned to the existing ESI project research_rhelai CC/ @hpdempsey @joachimweyl @tzumainn MOC-R8PAC23U26

joachimweyl commented 2 weeks ago

@hpdempsey provided confirmation that she will fund this research via email. @tzumainn please move MOC-R8PAC23U28 to research_rhelai project @tssala23 once moved please confirm you are able to access it.

tzumainn commented 2 weeks ago

Done!

hpdempsey commented 2 weeks ago

@tssala23 the OCTO team working on RHOAI with InstructLab cluster just asked for a second A100. We don't have any more available in ESI now, so that means I will have to give this one to them on Tuesday. (Yay we get to test short-term leases in ESI! ;-) ) Hopefully you can get some useful work done today. I had already requested that more GPUs be moved into ESI, so @joachimweyl, this raises the priority for that. But we don't really want the legacy ones from OpenStack for this, because that won't be representative of anything we would deploy InstructLab on in production normally. Please investigate and let me know when we can get at least one more "regular" A100 server in ESI.

hpdempsey commented 2 weeks ago

P.S. The legacy ones will be OK for researchers still, but we don't currently have any of them requesting bare metal GPUs currently that I know of.

joachimweyl commented 2 weeks ago

When we do move the legacy A100s they will go to testing OpenShift nodes not BM. 2 new A100SXM4 Nodes will become available Sep 3rd. That should cover this 1 additional request and leave one for the next request. Do we know how many more will be needed for RH? @jtriley we should probably cordon 4 A100SXM4 Nodes to be prepared for either more BM requests or an increase in OpenShift usage. Here is a GH issue for that.

joachimweyl commented 1 week ago

@tssala23 my understanding is that MOC-R8PAC23U26 is assigned to this and requires Lenovo fix, is that correct?

tssala23 commented 1 week ago

Yes, having the same issue the MOC-R8PAC23U28 was having

joachimweyl commented 3 days ago

@hakasapl is U28 working after Lenovo's bord replacement?

hakasapl commented 3 days ago

There are no errors anymore but we probably have to put the node under load to fully know. They should be added back into ESI today and Taj can test then.

joachimweyl commented 3 days ago

@tssala23 please confirm you are able to access MOC-R8PAC23U26 now that it is fixed by Lenovo. Can you remind me if you also need MOC-R8PAC23U28?

tzumainn commented 3 days ago

I can confirm - I provisioned both nodes with rhelai and waited 10 minutes (the restart issue would happen after one or two).

tssala23 commented 3 days ago

MOC-R8PAC23U28 is the node I will be using RHELAI on, that node has been provisioned with the RHELAI GA and I am able to access it. I will be making an issue for building another cluster which will most likely have MOC-R8PAC23U26 attached to it, however I am not sure if it needs the GPU on it immediately or not, that would be a Heidi question and a discussion that can be had when I create the issue