clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

CitC is broken on Oracle because Oracle Linux has updated to a broken ansible #45

Open chryswoods opened 2 years ago

chryswoods commented 2 years ago

This is more for info than anything else... plus is some useful rubber ducking / help for anyone else encountering the same issue.

The run_ansible script exits with errors when building clusters on Oracle Linux. The issue is that the ansible package was updated in May 2022, and this now uses Python 3.8. Somehow the packaging of Ansible on Oracle Linux fails to include various important module, e.g. python-ldap, PyMySQL amongst others. This means that run_ansible is unable to run and exits with an error. This is now caught by the finish script, which gives the impression that the cluster is still being built, and so installation appears to hang.

I tried downgrading ansible to the old version, but this appears to have disappeared from ol8-appstream, and my attempt to install from epel (by disabling ol8-appstream) failed because of unresolvable dependencies.

The solution I found was, as root, to yum remove ansible and then run pip3.6 install ansible. This installed ansible against python 3.6 (which has all of the required modules) and placed the result in /usr/local/bin. I then updated the path in the run_ansible script to /usr/local/bin/ansible-playbook and this completed every stage but one. The security-updates role failed with the error

TASK [security_updates : Install security updates] *****************************
Wednesday 06 July 2022  16:53:26 +0000 (0:00:00.746)       0:05:19.368 ******** 
fatal: [mgmt.subnet.sharpphoenix.oraclevcn.com]: FAILED! => changed=true 
  cmd:
  - dnf
  - update
  - -y
  - --security
  - --exclude
  - kernel*
  - --exclude
  - slurm*
  delta: '0:00:05.836997'
  end: '2022-07-06 16:53:32.741248'
  msg: non-zero return code
  rc: 1
  start: '2022-07-06 16:53:26.904251'
  stderr: |-
    Error:
     Problem 1: package shim-x64-15.6-1.0.3.el8.x86_64 requires oracle(kernel-sig-key) >= 202204, but none of the providers can be installed
      - cannot install the best update candidate for package shim-x64-15.3-1.0.3.x86_64
      - package kernel-4.18.0-372.13.1.0.1.el8_6.x86_64 is filtered out by exclude filtering
      - package kernel-uek-5.4.17-2136.307.3.6.el8uek.x86_64 is filtered out by exclude filtering
      - package kernel-uek-5.4.17-2136.308.7.el8uek.x86_64 is filtered out by exclude filtering
      - package kernel-uek-5.4.17-2136.308.9.el8uek.x86_64 is filtered out by exclude filtering
      - package kernel-uek-debug-5.4.17-2136.307.3.6.el8uek.x86_64 is filtered out by exclude filtering
     Problem 2: problem with installed package shim-x64-15.3-1.0.3.x86_64
      - package grub2-efi-x64-1:2.02-123.0.4.el8_6.8.x86_64 conflicts with shim-x64 <= 15.3-1.0.3 provided by shim-x64-15.3-1.0.3.x86_64
      - package shim-x64-15.6-1.0.3.el8.x86_64 requires oracle(kernel-sig-key) >= 202204, but none of the providers can be installed
      - cannot install the best update candidate for package grub2-efi-x64-1:2.02-123.0.1.el8.x86_64
      - package kernel-4.18.0-372.13.1.0.1.el8_6.x86_64 is filtered out by exclude filtering
      - package kernel-uek-5.4.17-2136.307.3.6.el8uek.x86_64 is filtered out by exclude filtering
      - package kernel-uek-5.4.17-2136.308.7.el8uek.x86_64 is filtered out by exclude filtering
      - package kernel-uek-5.4.17-2136.308.9.el8uek.x86_64 is filtered out by exclude filtering
      - package kernel-uek-debug-5.4.17-2136.307.3.6.el8uek.x86_64 is filtered out by exclude filtering
  stderr_lines: <omitted>
  stdout: |-
    Last metadata expiration check: 0:02:59 ago on Wed 06 Jul 2022 04:50:29 PM GMT.
    (try to add '--allowerasing' to command line to replace conflicting packages or '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
  stdout_lines: <omitted>

I removed this role (it is just a security update..!) and re-ran...

This now progressed all the way to the end, and I am just waiting for packer to finish creating the image...

...and now it has all completed. The cluster appears to be working. I'll add more if I find any other issues.

chryswoods commented 2 years ago

Just to add to this, booting the GPU node on a BM.GPU4.8 fails because Shape BM.GPU4.8 is not valid for image (error in /var/log/slurm/elastic.log). It seems that the base default Oracle 8 image is not set as being compatible with the bare-metal GPU4 nodes.

The fix I have tried is going into the Oracle console, finding the image, clicking "edit" and then adding GPU4.8 to the list of supported shapes for this image (I had already rebuilt the image with cuda drivers etc, although the error came up for the default CitC image created by packer at install time too).

(docs on how to see and change the compatible shapes for an image are here - search in page for "To view compatible shapes for a custom image")

There were warnings in the Oracle console that any incompatibilities between the image and shape could mean that the shape wouldn't boot. However, submitting a job does result in the node being provisioned, and a test job that ran nvidia-smi ran without issue via slurm and showed the output I would expect (all 8 GPUs visible and usable)