Closed uvNikita closed 6 years ago
Indeed -- I didn't want to force unexpected installations.
So the default way to install does not force the installation if the package is already present: see manifests/install/packages.pp#L117
Thanks for the answer! It sounds reasonable, but how do you perform updates on your cluster then? Just executing some script on the nodes?
Manually, I stop the daemons, remove the packages and re-run puppet to install the new version.
I might think indeed to have a manifest for that, or (simpler) a boolean to force the reinstall.
That would be great to have an option for this at some point. In the meantime, I'll do it manually as well.
the new version was released to allow for 17.11.3-2 installation.
I close this issue for now.
I've just spent some time trying to debug why newer versions of the package weren't being installed. I never thought that the default behaviour would be to not install and since it's such a niche piece of software I didn't think anyone would have had the same issue as me, so it took a while for me to look at the issue tracker. That's on me though.
My question is now: How would you modify the code to install a new version of the packages each time the version string is changed? I made an initial attempt by setting ensure => $version
but that ended up in dependency hell. Any tips appreciated.
Assuming you set do_package_install
and do_build
variables to true
(which is the default), it might still be a bug ;) I don't see why in slurm::install::packages
it would not proceed to the installation.
To answer your question, may be one idea would be to extract the current version of the package installed through a custom fact, for instance using the command
rpm -qa | egrep '^slurm-[1-9]' | sort -n | xargs rpm -q --queryformat '%{VERSION}'
and compare it to $version
.
It might be easier to force the install systematically, but that's something quite dangerous for such a sensitive software. For this reason, on our side, I actually prefer to set do_package_install
to false and make the update manually to be able to check carefully the logs etc. Here are how this is mostly done (probably I should integrate it in the docs of this module):
scontrol create res Reservation=slurmupdate StartTime=<YYYY-MM-DD>T<HH>:00:00 Duration=1-00:00:00 Flags=Maint,Ignore_Jobs Accounts=ulhpc Nodes=ALL
Backup Slurm VMs and BD
Put all nodes in 'drain' mode
control update NodeName=<name>-[<start>-<end>] State=drain reason="Slurm update to <version>"
slurm{d,ctld,dbd}
slurmd
daemons on the nodes and (if present) access nodes$> screen
# Tab 1: log slurmctld
$> tailf -n 100 /var/log/slurm/slurmctld.log
# Tab 2: logs slurmdbd
$> tailf -n 100 /var/log/slurm/slurmdbd.log
# Tab 3: work here ;)
# 1. Stop slurmctld first
$> systemctl status slurmctld.service
$> systemctl stop slurmctld.service && systemctl status slurmctld.service
# 2. Stop slurmDBD second
$> systemctl status slurmdbd.service
$> systemctl stop slurmdbd.service && systemctl status slurmdbd.service
$> systemctl status puppet
$> systemctl stop puppet.service && systemctl status puppet.service
Finally check the existing slurm packages installed (in a sorted list).
$> rpm -qa | grep -i slurm | sort
Ex: update the hiera puppet profile to include the new version (collect the SHA1 checksum from the official Slurm download page).
Ex:
# SLURM general settings
-slurm::version: '17.11.11'
+slurm::version: '17.11.12'
slurm::do_package_install: false
slurm::uid: 900
slurm::gid: 900
slurm::src_archived: false
-slurm::src_checksum: '3f68c925a8b4a187b9d641b6f341022a'
+slurm::src_checksum: '94fb13b509d23fcf9733018d6c961ca9'
slurm::service_manage: true
Apply puppet, it should build the new version but not install it (slurm::do_package_install: false
)
### Ex for 17.11.12 - ADAPT accordingly
cd /root/rpmbuild/RPMS/x86_64
# main controller/head + SlurmDBD node
ls --color=never slurm*17.11.12*.rpm
ls --color=never slurm*17.11.12*.rpm | xargs yum -y --nogpgcheck localinstall
# frontend nodes - no need for slurmd
ls --color=never slurm*17.11.12*.rpm | grep -vE '(sql|slurmdbd|slurmctld|slurmd)'
ls --color=never slurm*17.11.12*.rpm | grep -vE '(sql|slurmdbd|slurmctld|slurmd)' | xargs yum -y --nogpgcheck localinstall
# Compute node
ls --color=never slurm*17.11.12*.rpm | grep -vE '(sql|slurmdbd|slurmctld)'
ls --color=never slurm*17.11.12*.rpm | grep -vE '(sql|slurmdbd|slurmctld)' | xargs yum -y --nogpgcheck localinstall
# Check that everything is in order
$> rpm -qa | grep -i slurm | sort # check current version
$> rpm -qa | grep -i slurm | sort | wc -l
slurmdbd
# 1. slurmdbd
$> tailf /var/log/slurm/slurmdbd.log # if not yet done in one tab
$> systemctl start slurmdbd
Check the eventual conversion of the table. It should end with
[2018-02-07T22:38:37.072] Conversion done: success!
[2018-02-07T22:38:53.918] slurmdbd version 17.11.12 started
slurmctld
: check carefully the logs while restarting first slurmctld$> tailf /var/log/slurm/slurmctld.log # if not yet done
$> systemctl start slurmctld
It can take a while. At this stage, and assuming everything went fine, you probably want to update the slurm VM and reboot it.
slurmd
on the computing nodes$> tailf /var/log/slurm/slurmd.log # if not yet done
$> systemctl start slurmd
Note that I'll need to recheck the module to ensure it works with the 19.* release so it might come with some changes.
Assuming you set do_package_install and do_build variables to true (which is the default), it might still be a bug ;) I don't see why in slurm::install::packages it would not proceed to the installation.
Yes I too found that strange. Looking at this issue again though it is obvious that slurm::install::packages
does in fact proceed with installation, though failing in doing so:
Error: Could not update: Execution of '/bin/rpm -U --oldpackage --nodeps /root/rpmbuild/RPMS/x86_64/slurm-19.05.2*.rpm' returned 2: file /usr/share/man/man3/Slurm.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
file /usr/share/man/man3/Slurm::Bitstr.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
file /usr/share/man/man3/Slurm::Constant.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
file /usr/share/man/man3/Slurm::Hostlist.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
file /usr/share/man/man3/Slurm::Stepctx.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
file /usr/share/man/man3/Slurmdb.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
Error: /Stage[main]/Slurm::Install/Slurm::Install::Packages[19.05.2]/Package[slurm]/ensure: change from '19.05.0-0rc1.el7' to '19.05.2' failed: Could not update: Execution of '/bin/rpm -U --oldpackage --nodeps /root/rpmbuild/RPMS/x86_64/slurm-19.05.2*.rpm' returned 2: file /usr/share/man/man3/Slurm.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
file /usr/share/man/man3/Slurm::Bitstr.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
file /usr/share/man/man3/Slurm::Constant.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
file /usr/share/man/man3/Slurm::Hostlist.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
file /usr/share/man/man3/Slurm::Stepctx.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
file /usr/share/man/man3/Slurmdb.3pm.gz from install of slurm-19.05.2-1.el7.x86_64 conflicts with file from package slurm-perlapi-19.05.0-0rc1.el7.x86_64
My memory wants to tell me that the above was not present in the puppet agent -t
runs I ran a couple of weeks ago. Though it is much more probable that my memory is failing me.
Either way, thank you for the very thorough guide to how you upgrade your cluster. I thought SLURM was durable enough to handle having different minor versions running on different parts of the cluster. But I can imagine how having different versions of the software running slurmctld and slurmdbd would cause issues.
It looks like the module downloads and builds new packages, but doesn't install them. I've changed
version
to17.11.0-0rc3
andsrc_checksum
to51783493e95b839e6322fb95622adfa5
. The output of the puppet agent log:New packages appear in
/root/rpmbuild/RPMS/x86_64
, but installed version is still old: