HARPgroup / cbp_wsm

1 stars 0 forks source link

Install and configure slurm #52

Closed rburghol closed 1 year ago

rburghol commented 2 years ago

Overview

slurm.conf options


#### apt install slurm 
- old version

sudo apt install slurm-wlm mkdir /opt/model/slurm sudo chown slurm:modelers /opt/model/slurm sudo mkdir /var/spool/slurmctld sudo chown slurm:modelers /var/spool/slurmctld

config files: paste in configurator output: https://slurm.schedmd.com/configurator.html

nano /etc/slurm-llnl/slurm.conf

copy basic cgroup.conf file

cp /usr/share/doc/slurmd/examples/cgroup.conf /etc/slurm-llnl/

get it all ready to run automatically

sudo systemctl enable munge sudo systemctl enable slurmctld sudo systemctl enable slurmd

start them up

sudo systemctl start munge sudo systemctl start slurmctld sudo systemctl start slurmd

#### apt install slurmdbd
- `sudo apt install slurmdbd`

##### Configure slurmdbd

- The install above seems to work, `sinfo` appears to generate reasonable info:

sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug up infinite 1 unk deq2



- Make it go online
  - `sudo scontrol update NodeName=deq2 State=resume`
- list active jobs
  - `squeue`
- cancel a job
  - `scancel 12`
rburghol commented 2 years ago

Had problem starting slurmd (slurmctld started OK if slurm.conf is OK) Needed to get cgroup.conf set up cp /usr/share/doc/slurmd/examples/cgroup.conf /etc/slurm-llnl/

Problem: slurmd wil not start on deq2 , so jobs never run on : running sudo systemctl status slurmd.service yields:


● slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-02-23 19:03:00 UTC; 2min 1s ago
       Docs: man:slurmd(8)
    Process: 512228 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)

Feb 23 19:03:00 deq2 systemd[1]: Starting Slurm node daemon...
Feb 23 19:03:00 deq2 systemd[1]: slurmd.service: Control process exited, code=exited, status=1/FAILURE
Feb 23 19:03:00 deq2 systemd[1]: slurmd.service: Failed with result 'exit-code'.
Feb 23 19:03:00 deq2 systemd[1]: Failed to start Slurm node daemon.