ComputeCanada / puppet-magic_castle

Puppet Environment repo for Magic Castle - https://github.com/ComputeCanada/magic_castle
MIT License
13 stars 21 forks source link

make sure that `/var/lib/slurm` exists on login and mgmt nodes #353

Closed ostueker closed 7 months ago

ostueker commented 7 months ago

When trying to run certain scontrol commands (e.g. creating a reservation or draining a node) a directory /var/lib/slurm is required on the host that's issuing the command.

It would be good if puppet would make sure that such a directory exists.

cmd-ntrf commented 7 months ago

Can you provide a specific command that requires /var/lib/slurm and the error message that is output when the directory is missing? The Slurm version would also be useful.

I have called scontrol from a login node (which does not have /var/lib/slurm) to power up and down a node without issue in the past. Draining also appears to have no issue with /var/lib/slurm missing:

[centos@login1 ~]$ sudo /opt/software/slurm/bin/scontrol update nodename=node1 state=DRAIN reason="test"
[centos@login1 ~]$
ostueker commented 7 months ago

Sure. Here you go:

[centos@mgmt1 ~]$ sudo -iu slurm /opt/software/slurm/bin/scontrol update  nodename=nodecpu4 state=drain reason="test"
sudo: unable to change directory to /var/lib/slurm: No such file or directory
[centos@mgmt1 ~]$ sudo mkdir -p /var/lib/slurm
[centos@mgmt1 ~]$ sudo -iu slurm /opt/software/slurm/bin/scontrol update  nodename=nodecpu4 state=drain reason="test"
[centos@mgmt1 ~]$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
test                 slurm     2024-04-17T16:54:00 nodecpu4

and

[centos@login1 ~]$ sudo -iu slurm /opt/software/slurm/bin/scontrol create reservation user=root,user097,user100 starttime=now duration=7-00:00:00 flags=maint,ignore_jobs nodes=nodecpu4 
sudo: unable to change directory to /var/lib/slurm: No such file or directory
Reservation created: root_1
[centos@login1 ~]$ sudo mkdir -p /var/lib/slurm
[centos@login1 ~]$ sudo -iu slurm /opt/software/slurm/bin/scontrol create reservation user=root,user097,user100 starttime=now duration=7-00:00:00 flags=maint,ignore_jobs nodes=nodecpu4 
Reservation created: root_2
cmd-ntrf commented 7 months ago

The problem is not with scontrol per-say. The problem is the home directory of the user slurm is /var/lib/slurm and you are using sudo with the login flag (-i / --login), which is unnecessary.

sudo -u slurm /opt/software/slurm/bin/scontrol update nodename=nodecpu4 state=drain reason="test"

returns no error.