Open chryswoods opened 2 years ago
Just to add to this, booting the GPU node on a BM.GPU4.8 fails because Shape BM.GPU4.8 is not valid for image
(error in /var/log/slurm/elastic.log
). It seems that the base default Oracle 8 image is not set as being compatible with the bare-metal GPU4 nodes.
The fix I have tried is going into the Oracle console, finding the image, clicking "edit" and then adding GPU4.8 to the list of supported shapes for this image (I had already rebuilt the image with cuda drivers etc, although the error came up for the default CitC image created by packer at install time too).
(docs on how to see and change the compatible shapes for an image are here - search in page for "To view compatible shapes for a custom image")
There were warnings in the Oracle console that any incompatibilities between the image and shape could mean that the shape wouldn't boot. However, submitting a job does result in the node being provisioned, and a test job that ran nvidia-smi
ran without issue via slurm and showed the output I would expect (all 8 GPUs visible and usable)
This is more for info than anything else... plus is some useful rubber ducking / help for anyone else encountering the same issue.
The
run_ansible
script exits with errors when building clusters on Oracle Linux. The issue is that theansible
package was updated in May 2022, and this now uses Python 3.8. Somehow the packaging of Ansible on Oracle Linux fails to include various important module, e.g. python-ldap, PyMySQL amongst others. This means thatrun_ansible
is unable to run and exits with an error. This is now caught by thefinish
script, which gives the impression that the cluster is still being built, and so installation appears to hang.I tried downgrading ansible to the old version, but this appears to have disappeared from
ol8-appstream
, and my attempt to install fromepel
(by disablingol8-appstream
) failed because of unresolvable dependencies.The solution I found was, as root, to
yum remove ansible
and then runpip3.6 install ansible
. This installed ansible against python 3.6 (which has all of the required modules) and placed the result in/usr/local/bin
. I then updated the path in therun_ansible
script to/usr/local/bin/ansible-playbook
and this completed every stage but one. Thesecurity-updates
role failed with the errorI removed this role (it is just a security update..!) and re-ran...
This now progressed all the way to the end, and I am just waiting for packer to finish creating the image...
...and now it has all completed. The cluster appears to be working. I'll add more if I find any other issues.