Open alexlenail opened 8 years ago
@YidingZhou Can you take a look at this?
@YidingZhou @singhkay any news?
@YidingZhou @singhkay still no news?
@singhkay @YidingZhou Why is this labeled question?
@zfrenchee Can you make sure the daemon started? The setup scripts run the following 3 steps.
sudo -u slurm /usr/sbin/slurmctld sudo munged --force sudo slurmd
Can you do a "ps" and see if all 3 daemons are running?
Hi @YidingZhou, glad you got back to me. Would you like me to run those sudo commands? From your comment I gather they should have already run, so I didn't...
alex@master:~$ ps a
PID TTY STAT TIME COMMAND
1226 tty1 Ss+ 0:00 /sbin/agetty --noclear tty1 linux
1227 ttyS0 Ss+ 0:00 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
...
alex@master:~$ srun -N3 hostname
master
worker0
worker1
@zfrenchee
Sorry for not being clear. Can you do a "ps -ef |grep slurm" and see which slurm related processes are running?
From the last comment it seems that everything is running?
Hi @YidingZhou ,
slurm 3674 1 0 17:54 ? 00:00:01 /usr/sbin/slurmctld
root 3685 1 0 17:54 ? 00:00:00 slurmd
alex 4208 4190 0 18:43 pts/0 00:00:00 grep --color=auto slurm
If you read the repro steps in my original issue, you may understand better what I'm having trouble with.
@zfrenchee so, what you are saying is that,
Is this correct?
If you do a "systemctl status slurmctld.service"
, do you see something like below?
slurmctld.service: PID file /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start: No such file or directory
I'm asking because I'm hitting this issue now. Just want to double check.
@YidingZhou Before restart:
alex@master:~$ srun -N3 hostname
master
worker0
worker1
alex@master:~$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
Active: inactive (dead)
Condition: start condition failed at Mon 2016-06-06 17:54:19 UTC; 19h ago
After restart:
alex@master:~$ srun -N3 hostname
srun: error: Unable to allocate resources: Unable to contact slurm controller (connect failure)
alex@master:~$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
Active: failed (Result: resources) since Tue 2016-06-07 13:53:14 UTC; 46s ago
Process: 745 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
Jun 07 13:53:14 master systemd[1]: Starting Slurm controller daemon...
Jun 07 13:53:14 master slurmctld[809]: error: chdir(/var/spool): Permission denied
Jun 07 13:53:14 master slurmctld[809]: chdir to /var/tmp
Jun 07 13:53:14 master systemd[1]: slurmctld.service: PID file /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start: No such file or directory
Jun 07 13:53:14 master systemd[1]: Failed to start Slurm controller daemon.
Jun 07 13:53:14 master systemd[1]: slurmctld.service: Unit entered failed state.
Jun 07 13:53:14 master systemd[1]: slurmctld.service: Failed with result 'resources'.
@YidingZhou @singhkay Would you please change the label from question to bug? Has there been any progress on this? I'm blocked by this.
@zfrenchee Changed label
@YidingZhou is the person to take a look at this
@zfrenchee I'm looking at this. I can reproduce this issue on my side. Will let you know if I make a break through.
@YidingZhou Any news?
@zfrenchee I think I have figured out the issue. It has something todo with StateSaveLocation in slurm.conf and permission on /var/spool. If you create a directory slurm under /var/spool, chown it to slurm:slurm, move the last_config_lite to that directory, you should now be able to start slurmctld.
I'm still working on the template to fix this issue.
I'm having the same issue. Where can I find last_config_lite?
Thanks @YidingZhou! Please let us know once you've fixed this!
One more ping @YidingZhou
Still hoping this can be completed @YidingZhou
last_config_lite is in /var/spool. The workaround, however, does not work for me, @YidingZhou
I've looked into this a bit, here are some issues I've discovered:
I tried a manual install of slurm on a Debian 8 VM due to the rebooting - /var/log group write permission issue above. I was able to get things much closer to running, but I still didn't manage to stand up a slurm installation that worked. I'm getting timeouts on the slurmd and slurmctld services :-(
@zfrenchee @YidingZhou
FYI, here is my set of steps for attempting to bring slurm up on a debian 8 vm:
Install Debian 8 VM sudo apt-get update/upgrade sudo apt-get install slurm-llnl sudo mkdir /var/spool/slurmd sudo chown slurm /var/spool/slurmd sudo chgrp slurm /var/spool/slurmd Create and download slurm.conf on desktop Had to move SaveStateLocation to /var/spool/slurmd, /var/spool has permissions errors from desktop, scp slurm.conf %vmhostname%:slurm.conf sudo cp slurm.conf /etc/slurm-llnl/ sudo systemctl enable munge sudo shutdown -r now
FYI, the issue with my attempt above is that the slurm-llnl package on debian (and ubuntu) sets up the services with:
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
which is different than the default slurm.conf file locations. Once these two settings were adjusted, thing started working properly with the above instructions.
Template: Slurm
Issue Details
Upon restarting a VM, slurm seems to break down. How might it be possible to prevent this?
Repro steps
srun -N3 hostname
-> succeedssrun -N3 hostname
-> fails with error message: