Open tssala23 opened 2 weeks ago
@hpdempsey provided confirmation that she will fund this research via email.
@tzumainn please move MOC-R8PAC23U28
to research_rhelai
project
@tssala23 once moved please confirm you are able to access it.
Done!
@tssala23 the OCTO team working on RHOAI with InstructLab cluster just asked for a second A100. We don't have any more available in ESI now, so that means I will have to give this one to them on Tuesday. (Yay we get to test short-term leases in ESI! ;-) ) Hopefully you can get some useful work done today. I had already requested that more GPUs be moved into ESI, so @joachimweyl, this raises the priority for that. But we don't really want the legacy ones from OpenStack for this, because that won't be representative of anything we would deploy InstructLab on in production normally. Please investigate and let me know when we can get at least one more "regular" A100 server in ESI.
P.S. The legacy ones will be OK for researchers still, but we don't currently have any of them requesting bare metal GPUs currently that I know of.
When we do move the legacy A100s they will go to testing OpenShift nodes not BM. 2 new A100SXM4 Nodes will become available Sep 3rd. That should cover this 1 additional request and leave one for the next request. Do we know how many more will be needed for RH? @jtriley we should probably cordon 4 A100SXM4 Nodes to be prepared for either more BM requests or an increase in OpenShift usage. Here is a GH issue for that.
@tssala23 my understanding is that MOC-R8PAC23U26
is assigned to this and requires Lenovo fix, is that correct?
Yes, having the same issue the MOC-R8PAC23U28
was having
@hakasapl is U28 working after Lenovo's bord replacement?
There are no errors anymore but we probably have to put the node under load to fully know. They should be added back into ESI today and Taj can test then.
@tssala23 please confirm you are able to access MOC-R8PAC23U26
now that it is fixed by Lenovo. Can you remind me if you also need MOC-R8PAC23U28
?
I can confirm - I provisioned both nodes with rhelai and waited 10 minutes (the restart issue would happen after one or two).
MOC-R8PAC23U28
is the node I will be using RHELAI on, that node has been provisioned with the RHELAI GA and I am able to access it.
I will be making an issue for building another cluster which will most likely have MOC-R8PAC23U26
attached to it, however I am not sure if it needs the GPU on it immediately or not, that would be a Heidi question and a discussion that can be had when I create the issue
I will be testing using instruct lab on the RHEL AI image provided from the RHEL AI team. Currently instruct lab only runs on GPU nodes. The A100 can be assigned to the existing ESI project
research_rhelai
CC/ @hpdempsey @joachimweyl @tzumainnMOC-R8PAC23U26