Closed rburghol closed 1 year ago
Had problem starting slurmd (slurmctld started OK if slurm.conf is OK)
Needed to get cgroup.conf set up cp /usr/share/doc/slurmd/examples/cgroup.conf /etc/slurm-llnl/
Problem: slurmd wil not start on deq2 , so jobs never run on :
running sudo systemctl status slurmd.service
yields:
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2022-02-23 19:03:00 UTC; 2min 1s ago
Docs: man:slurmd(8)
Process: 512228 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Feb 23 19:03:00 deq2 systemd[1]: Starting Slurm node daemon...
Feb 23 19:03:00 deq2 systemd[1]: slurmd.service: Control process exited, code=exited, status=1/FAILURE
Feb 23 19:03:00 deq2 systemd[1]: slurmd.service: Failed with result 'exit-code'.
Feb 23 19:03:00 deq2 systemd[1]: Failed to start Slurm node daemon.
Overview
/etc/slurm-llnl/slurm.conf
slurm.conf options
cat /etc/slurm-llnl/slurm.conf |grep -v "#"
)sudo apt install slurm-wlm mkdir /opt/model/slurm sudo chown slurm:modelers /opt/model/slurm sudo mkdir /var/spool/slurmctld sudo chown slurm:modelers /var/spool/slurmctld
config files: paste in configurator output: https://slurm.schedmd.com/configurator.html
nano /etc/slurm-llnl/slurm.conf
copy basic cgroup.conf file
cp /usr/share/doc/slurmd/examples/cgroup.conf /etc/slurm-llnl/
get it all ready to run automatically
sudo systemctl enable munge sudo systemctl enable slurmctld sudo systemctl enable slurmd
start them up
sudo systemctl start munge sudo systemctl start slurmctld sudo systemctl start slurmd
sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug up infinite 1 unk deq2