Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
55 stars 42 forks source link

AlmaLinux 8 GPG Key has been updated causing Slurm packages to no longer install #214

Closed garymansellricardo closed 5 months ago

garymansellricardo commented 6 months ago

The AlmaLinux package maintainer says that they updated their GPG keys last month - so we can no longer install dependent package

Hello. Long story short: rpm --import https://repo.almalinux.org/almalinux/RPM-GPG-KEY-AlmaLinux

Since January 15 2024 we sign AlmaLinux 8 packages with new GPG key. So it's not only about mysql.

Details: [https://almalinux.org/blog/2023-12-20-almalinux-8-key-update/] Sorry for inconvenience. --

Best Regards,

Andrew Lukoshko | AlmaLinux OS Architect

This is the error in CycleCloud UI when trying to build the Scheduler Node (and ultimately compute node):

Failed to execute cluster-init script '/mnt/cluster-init/slurm/scheduler/scripts/00-install.sh' in project 'slurm' (return code: 1) Software Configuration Edit and re-upload the script to correct the error and try again Get more help on this issue Detail: Script output: Last metadata expiration check: 0:08:52 ago on Fri 23 Feb 2024 11:42:41 AM EST. Package epel-release-8-19.el8.noarch is already installed. Dependencies resolved. Nothing to do. Complete! Last metadata expiration check: 0:08:53 ago on Fri 23 Feb 2024 11:42:41 AM EST. Package munge-0.5.13-2.el8.x86_64 is already installed. Dependencies resolved. Nothing to do. Complete! Last metadata expiration check: 0:08:47 ago on Fri 23 Feb 2024 11:42:48 AM EST. Package perl-Switch-2.17-10.el8.noarch is already installed. Dependencies resolved. Nothing to do. Complete! Last metadata expiration check: 0:08:55 ago on Fri 23 Feb 2024 11:42:41 AM EST. Dependencies resolved.

Package Arch Version Repository Size

Installing: slurm x86_64 23.02.6-1.el8 @commandline 20 M slurm-contribs x86_64 23.02.6-1.el8 @commandline 21 k slurm-devel x86_64 23.02.6-1.el8 @commandline 84 k slurm-example-configs x86_64 23.02.6-1.el8 @commandline 13 k slurm-libpmi x86_64 23.02.6-1.el8 @commandline 162 k slurm-openlava x86_64 23.02.6-1.el8 @commandline 13 k slurm-pam_slurm x86_64 23.02.6-1.el8 @commandline 170 k slurm-perlapi x86_64 23.02.6-1.el8 @commandline 794 k slurm-slurmctld x86_64 23.02.6-1.el8 @commandline 1.5 M slurm-slurmd x86_64 23.02.6-1.el8 @commandline 785 k slurm-slurmdbd x86_64 23.02.6-1.el8 @commandline 877 k slurm-slurmrestd x86_64 23.02.6-1.el8 @commandline 155 k slurm-torque x86_64 23.02.6-1.el8 @commandline 133 k Installing dependencies: http-parser x86_64 2.8.0-9.el8 appstream 41 k mariadb-connector-c-config noarch 3.1.11-2.el8_3 appstream 14 k mysql-common x86_64 8.0.36-1.module_el8.9.0+3735+82bd6c11 appstream 136 k mysql-libs x86_64 8.0.36-1.module_el8.9.0+3735+82bd6c11 appstream 1.5 M Enabling module streams: mysql 8.0

Transaction Summary

Install 17 Packages

Total size: 26 M Installed size: 114 M Downloading Packages: [SKIPPED] http-parser-2.8.0-9.el8.x86_64.rpm: Already downloaded
[SKIPPED] mariadb-connector-c-config-3.1.11-2.el8_3.noarch.rpm: Already downloaded [SKIPPED] mysql-common-8.0.36-1.module_el8.9.0+3735+82bd6c11.x86_64.rpm: Already downloaded [SKIPPED] mysql-libs-8.0.36-1.module_el8.9.0+3735+82bd6c11.x86_64.rpm: Already downloaded AlmaLinux 8 - AppStream 3.3 MB/s | 3.4 kB 00:00
Importing GPG key 0xC21AD6EA: Userid : "AlmaLinux packager@almalinux.org" Fingerprint: E53C F5EF 91CE B0AD 1812 ECB8 51D6 647E C21A D6EA From : /etc/pki/rpm-gpg/RPM-GPG-KEY-AlmaLinux Key imported successfully Import of key(s) didn't help, wrong key(s)? Public key for mysql-common-8.0.36-1.module_el8.9.0+3735+82bd6c11.x86_64.rpm is not installed. Failing package is: mysql-common-8.0.36-1.module_el8.9.0+3735+82bd6c11.x86_64 GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-AlmaLinux Public key for mysql-libs-8.0.36-1.module_el8.9.0+3735+82bd6c11.x86_64.rpm is not installed. Failing package is: mysql-libs-8.0.36-1.module_el8.9.0+3735+82bd6c11.x86_64 GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-AlmaLinux The downloaded packages were saved in cache until the next successful transaction. You can remove cached packages by executing 'yum clean packages'. Error: GPG check FAILED An error occured during installation. See log file /var/log/azure-slurm-install.log for details. 2024-02-23 11:51:37,314 ERROR: An error occured during installation. Traceback (most recent call last): File "install.py", line 582, in main() File "install.py", line 558, in main run_installer(settings, os.path.abspath(f"{args.platform}.sh"), args.mode) File "install.py", line 145, in run_installer subprocess.check_call([path, mode, s.slurmver]) File "/opt/cycle/jetpack/system/embedded/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/opt/cycle/jetpack/system/bootstrap/azure-slurm-install/rhel.sh', 'scheduler', '23.02.6-1']' returned non-zero exit status 1.

As you can see it is the issue with the mysql rpm package signing - these are dependencies of the azure-slurm-install-pkg-3.0.5.tar.gz packages (which I think are related to this GitHub)?

garymansellricardo commented 5 months ago

For now, I have managed to get my AlmaLinux 8 Scheduler and Processing nodes to come up again by adding the new RPM GPG Key to the cloud-init script:

cloud-config

runcmd:  - rpm --import https://repo.almalinux.org/almalinux/RPM-GPG-KEY-AlmaLinux

aditigaur4 commented 5 months ago

Hello what you have described is the solution here, Adding that command to cloud-init should work. We ahve asked azhpc images team to include this in the images and they will also be publishing new versions of almalinux hpc image so this problem will be solved permanently then

garymansellricardo commented 5 months ago

That's great news thanks. As a further follow up to my above temp fix, it can be improved by changing it to this:

cloud-config

bootcmd:

By using bootcmd, it runs early in the cloud-init process and crucially before any packages that are reference to install in the cloud-init (runcmd runs after).

Hence, if any packages in the cloud-init are signed by the new cert, they will also cause cloud-init to fail and the nodes not to start.

Hope that helps someone (as it foxed me).