ULHPC / puppet-slurm

A Puppet module designed to configure and manage SLURM(see https://slurm.schedmd.com/), an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters
Apache License 2.0
19 stars 25 forks source link

slurm updates #8

Closed uvNikita closed 6 years ago

uvNikita commented 6 years ago

It looks like the module downloads and builds new packages, but doesn't install them. I've changed version to 17.11.0-0rc3 and src_checksum to 51783493e95b839e6322fb95622adfa5. The output of the puppet agent log:

(/Stage[main]/Slurm::Install/Slurm::Download[17.11.0-0rc3]/Archive[slurm-17.11.0-0rc3.tar.bz2]/ensure) download archive from https://www.schedmd.com/downloads/latest/slurm-17.11.0-0rc3.tar.bz2 to /usr/local/src/slurm-17.11.0-0rc3.tar.bz2  without cleanup
(/Stage[main]/Slurm::Install/Slurm::Build[17.11.0-0rc3]/Exec[build-slurm-17.11.0-0rc3]/returns) executed successfully

New packages appear in /root/rpmbuild/RPMS/x86_64, but installed version is still old:

$ slurmd -V
slurm 17.02.7

$ yum info slurm
Name        : slurm
Arch        : x86_64
Version     : 17.02.7
Release     : 1.el7.centos
Size        : 93 M
Repo        : installed
Summary     : Slurm Workload Manager
URL         : https://slurm.schedmd.com/
License     : GPL
Description : Slurm is an open source, fault-tolerant, and highly
            : scalable cluster management and job scheduling system for Linux clusters.
            : Components include machine status, partition management, job management,
            : scheduling and accounting modules
Falkor commented 6 years ago

Indeed -- I didn't want to force unexpected installations. So the default way to install does not force the installation if the package is already present: see manifests/install/packages.pp#L117

uvNikita commented 6 years ago

Thanks for the answer! It sounds reasonable, but how do you perform updates on your cluster then? Just executing some script on the nodes?

Falkor commented 6 years ago

Manually, I stop the daemons, remove the packages and re-run puppet to install the new version.
I might think indeed to have a manifest for that, or (simpler) a boolean to force the reinstall.

uvNikita commented 6 years ago

That would be great to have an option for this at some point. In the meantime, I'll do it manually as well.

Falkor commented 6 years ago

the new version was released to allow for 17.11.3-2 installation.

I close this issue for now.

Rovanion commented 5 years ago

I've just spent some time trying to debug why newer versions of the package weren't being installed. I never thought that the default behaviour would be to not install and since it's such a niche piece of software I didn't think anyone would have had the same issue as me, so it took a while for me to look at the issue tracker. That's on me though.

My question is now: How would you modify the code to install a new version of the packages each time the version string is changed? I made an initial attempt by setting ensure => $version but that ended up in dependency hell. Any tips appreciated.

Falkor commented 5 years ago

Assuming you set do_package_install and do_build variables to true (which is the default), it might still be a bug ;) I don't see why in slurm::install::packages it would not proceed to the installation.

To answer your question, may be one idea would be to extract the current version of the package installed through a custom fact, for instance using the command

rpm -qa | egrep '^slurm-[1-9]' | sort -n | xargs rpm -q --queryformat '%{VERSION}'

and compare it to $version. It might be easier to force the install systematically, but that's something quite dangerous for such a sensitive software. For this reason, on our side, I actually prefer to set do_package_install to false and make the update manually to be able to check carefully the logs etc. Here are how this is mostly done (probably I should integrate it in the docs of this module):

  1. Make a reservation for the maintenance

scontrol create res Reservation=slurmupdate StartTime=<YYYY-MM-DD>T<HH>:00:00 Duration=1-00:00:00 Flags=Maint,Ignore_Jobs Accounts=ulhpc Nodes=ALL

  1. Backup Slurm VMs and BD

  2. Put all nodes in 'drain' mode

control update NodeName=<name>-[<start>-<end>] State=drain reason="Slurm update to <version>"

  1. Stop all Slurm daemons slurm{d,ctld,dbd}
    • Stop all slurmd daemons on the nodes and (if present) access nodes
    • on the slurm controller / database server(s), stop all daemons (leaving slurmdbd at the last moment), open a screen and several tabs:
$> screen

# Tab 1: log slurmctld
$> tailf -n 100 /var/log/slurm/slurmctld.log
# Tab 2: logs slurmdbd
$> tailf -n 100 /var/log/slurm/slurmdbd.log
# Tab 3: work here ;)
# 1. Stop slurmctld first
$> systemctl status slurmctld.service
$> systemctl stop slurmctld.service && systemctl status slurmctld.service

# 2. Stop slurmDBD second
$> systemctl status slurmdbd.service
$> systemctl stop slurmdbd.service && systemctl status slurmdbd.service
  1. Stop puppet agent
$> systemctl status puppet
$> systemctl stop puppet.service && systemctl status puppet.service

Finally check the existing slurm packages installed (in a sorted list).

$> rpm -qa | grep -i slurm | sort
  1. Prepare puppet to build the new version

Ex: update the hiera puppet profile to include the new version (collect the SHA1 checksum from the official Slurm download page).

Ex:

 # SLURM general settings
-slurm::version: '17.11.11'
+slurm::version: '17.11.12'
slurm::do_package_install: false
 slurm::uid: 900
 slurm::gid: 900
 slurm::src_archived: false
-slurm::src_checksum: '3f68c925a8b4a187b9d641b6f341022a'
+slurm::src_checksum: '94fb13b509d23fcf9733018d6c961ca9'
 slurm::service_manage: true

Apply puppet, it should build the new version but not install it (slurm::do_package_install: false)

  1. install the newly built packages
### Ex for 17.11.12 - ADAPT accordingly
cd /root/rpmbuild/RPMS/x86_64
# main controller/head + SlurmDBD node
ls --color=never slurm*17.11.12*.rpm
ls --color=never slurm*17.11.12*.rpm | xargs yum -y --nogpgcheck localinstall

# frontend nodes - no need for slurmd
ls --color=never slurm*17.11.12*.rpm | grep -vE '(sql|slurmdbd|slurmctld|slurmd)'
ls --color=never slurm*17.11.12*.rpm | grep -vE '(sql|slurmdbd|slurmctld|slurmd)' | xargs yum -y --nogpgcheck localinstall

# Compute node
ls --color=never slurm*17.11.12*.rpm | grep -vE '(sql|slurmdbd|slurmctld)'
ls --color=never slurm*17.11.12*.rpm | grep -vE '(sql|slurmdbd|slurmctld)' | xargs yum -y --nogpgcheck localinstall
  1. check installed version
# Check that everything is in order
$> rpm -qa | grep -i slurm | sort   # check current version
$> rpm -qa | grep -i slurm | sort | wc -l
  1. start the daemons and check carefully the logs while restarting
    • Start with slurmdbd
# 1. slurmdbd 
$> tailf /var/log/slurm/slurmdbd.log   # if not yet done in one tab
$> systemctl start slurmdbd

Check the eventual conversion of the table. It should end with

[2018-02-07T22:38:37.072] Conversion done: success!
[2018-02-07T22:38:53.918] slurmdbd version 17.11.12 started
$> tailf /var/log/slurm/slurmctld.log   # if not yet done
$> systemctl start slurmctld

It can take a while. At this stage, and assuming everything went fine, you probably want to update the slurm VM and reboot it.

$> tailf /var/log/slurm/slurmd.log   # if not yet done
$> systemctl start slurmd
Falkor commented 5 years ago

Note that I'll need to recheck the module to ensure it works with the 19.* release so it might come with some changes.

Rovanion commented 5 years ago

Assuming you set do_package_install and do_build variables to true (which is the default), it might still be a bug ;) I don't see why in slurm::install::packages it would not proceed to the installation.

Yes I too found that strange. Looking at this issue again though it is obvious that slurm::install::packages does in fact proceed with installation, though failing in doing so:

Error: Could not update: Execution of '/bin/rpm -U --oldpackage --nodeps /root/rpmbuild/RPMS/x86_64/slurm-19.05.2*.rpm' returned 2: file /usr/share/man/man3/Slurm.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
    file /usr/share/man/man3/Slurm::Bitstr.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
    file /usr/share/man/man3/Slurm::Constant.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
    file /usr/share/man/man3/Slurm::Hostlist.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
    file /usr/share/man/man3/Slurm::Stepctx.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
    file /usr/share/man/man3/Slurmdb.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
Error: /Stage[main]/Slurm::Install/Slurm::Install::Packages[19.05.2]/Package[slurm]/ensure: change from '19.05.0-0rc1.el7' to '19.05.2' failed: Could not update: Execution of '/bin/rpm -U --oldpackage --nodeps /root/rpmbuild/RPMS/x86_64/slurm-19.05.2*.rpm' returned 2: file /usr/share/man/man3/Slurm.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
    file /usr/share/man/man3/Slurm::Bitstr.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
    file /usr/share/man/man3/Slurm::Constant.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
    file /usr/share/man/man3/Slurm::Hostlist.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
    file /usr/share/man/man3/Slurm::Stepctx.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
    file /usr/share/man/man3/Slurmdb.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64

My memory wants to tell me that the above was not present in the puppet agent -t runs I ran a couple of weeks ago. Though it is much more probable that my memory is failing me.

Either way, thank you for the very thorough guide to how you upgrade your cluster. I thought SLURM was durable enough to handle having different minor versions running on different parts of the cluster. But I can imagine how having different versions of the software running slurmctld and slurmdbd would cause issues.