Azure / azure-quickstart-templates

Azure Quickstart Templates
https://aka.ms/azqst
MIT License
13.95k stars 16.08k forks source link

srun: error: Unable to allocate resources: Unable to contact slurm controller (connect failure) #1796

Open alexlenail opened 8 years ago

alexlenail commented 8 years ago

Template: Slurm

Issue Details

Upon restarting a VM, slurm seems to break down. How might it be possible to prevent this?

Repro steps

  1. Use the button to launch a slurm cluster.
  2. ssh into master
  3. srun -N3 hostname -> succeeds
  4. Restart master
  5. ssh into master
  6. srun -N3 hostname -> fails with error message:
srun: error: Unable to allocate resources: Unable to contact slurm controller (connect failure)
singhkays commented 8 years ago

@YidingZhou Can you take a look at this?

alexlenail commented 8 years ago

@YidingZhou @singhkay any news?

alexlenail commented 8 years ago

@YidingZhou @singhkay still no news?

alexlenail commented 8 years ago

@singhkay @YidingZhou Why is this labeled question?

YidingZhou commented 8 years ago

@zfrenchee Can you make sure the daemon started? The setup scripts run the following 3 steps.

sudo -u slurm /usr/sbin/slurmctld sudo munged --force sudo slurmd

Can you do a "ps" and see if all 3 daemons are running?

alexlenail commented 8 years ago

Hi @YidingZhou, glad you got back to me. Would you like me to run those sudo commands? From your comment I gather they should have already run, so I didn't...

alex@master:~$ ps a
  PID TTY      STAT   TIME COMMAND
 1226 tty1     Ss+    0:00 /sbin/agetty --noclear tty1 linux
 1227 ttyS0    Ss+    0:00 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
...
alex@master:~$ srun -N3 hostname
master
worker0
worker1
YidingZhou commented 8 years ago

@zfrenchee

Sorry for not being clear. Can you do a "ps -ef |grep slurm" and see which slurm related processes are running?

From the last comment it seems that everything is running?

alexlenail commented 8 years ago

Hi @YidingZhou ,

slurm     3674     1  0 17:54 ?        00:00:01 /usr/sbin/slurmctld
root      3685     1  0 17:54 ?        00:00:00 slurmd
alex      4208  4190  0 18:43 pts/0    00:00:00 grep --color=auto slurm

If you read the repro steps in my original issue, you may understand better what I'm having trouble with.

YidingZhou commented 8 years ago

@zfrenchee so, what you are saying is that,

1 Right after installation, everything runs fine and you can see slurmctld process running. But

2 After a reboot of master node, srun complains about unable to contact slurm controller. And you cannot see slurmctld/slurmd process.

Is this correct?

If you do a "systemctl status slurmctld.service", do you see something like below?

slurmctld.service: PID file /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start: No such file or directory

I'm asking because I'm hitting this issue now. Just want to double check.

alexlenail commented 8 years ago

@YidingZhou Before restart:

alex@master:~$ srun -N3 hostname
master
worker0
worker1
alex@master:~$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
   Active: inactive (dead)
Condition: start condition failed at Mon 2016-06-06 17:54:19 UTC; 19h ago

After restart:

alex@master:~$ srun -N3 hostname
srun: error: Unable to allocate resources: Unable to contact slurm controller (connect failure)
alex@master:~$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
   Active: failed (Result: resources) since Tue 2016-06-07 13:53:14 UTC; 46s ago
  Process: 745 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)

Jun 07 13:53:14 master systemd[1]: Starting Slurm controller daemon...
Jun 07 13:53:14 master slurmctld[809]: error: chdir(/var/spool): Permission denied
Jun 07 13:53:14 master slurmctld[809]: chdir to /var/tmp
Jun 07 13:53:14 master systemd[1]: slurmctld.service: PID file /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start: No such file or directory
Jun 07 13:53:14 master systemd[1]: Failed to start Slurm controller daemon.
Jun 07 13:53:14 master systemd[1]: slurmctld.service: Unit entered failed state.
Jun 07 13:53:14 master systemd[1]: slurmctld.service: Failed with result 'resources'.
alexlenail commented 8 years ago

@YidingZhou @singhkay Would you please change the label from question to bug? Has there been any progress on this? I'm blocked by this.

singhkays commented 8 years ago

@zfrenchee Changed label

@YidingZhou is the person to take a look at this

YidingZhou commented 8 years ago

@zfrenchee I'm looking at this. I can reproduce this issue on my side. Will let you know if I make a break through.

alexlenail commented 8 years ago

@YidingZhou Any news?

YidingZhou commented 8 years ago

@zfrenchee I think I have figured out the issue. It has something todo with StateSaveLocation in slurm.conf and permission on /var/spool. If you create a directory slurm under /var/spool, chown it to slurm:slurm, move the last_config_lite to that directory, you should now be able to start slurmctld.

I'm still working on the template to fix this issue.

GlastonburyC commented 8 years ago

I'm having the same issue. Where can I find last_config_lite?

alexlenail commented 8 years ago

Thanks @YidingZhou! Please let us know once you've fixed this!

alexlenail commented 8 years ago

One more ping @YidingZhou

alexlenail commented 8 years ago

Still hoping this can be completed @YidingZhou

alexlenail commented 8 years ago

last_config_lite is in /var/spool. The workaround, however, does not work for me, @YidingZhou

JayBeavers commented 8 years ago

I've looked into this a bit, here are some issues I've discovered:

I tried a manual install of slurm on a Debian 8 VM due to the rebooting - /var/log group write permission issue above. I was able to get things much closer to running, but I still didn't manage to stand up a slurm installation that worked. I'm getting timeouts on the slurmd and slurmctld services :-(

@zfrenchee @YidingZhou

JayBeavers commented 8 years ago

FYI, here is my set of steps for attempting to bring slurm up on a debian 8 vm:

Install Debian 8 VM sudo apt-get update/upgrade sudo apt-get install slurm-llnl sudo mkdir /var/spool/slurmd sudo chown slurm /var/spool/slurmd sudo chgrp slurm /var/spool/slurmd Create and download slurm.conf on desktop Had to move SaveStateLocation to /var/spool/slurmd, /var/spool has permissions errors from desktop, scp slurm.conf %vmhostname%:slurm.conf sudo cp slurm.conf /etc/slurm-llnl/ sudo systemctl enable munge sudo shutdown -r now

slurm.conf.txt

JayBeavers commented 8 years ago

FYI, the issue with my attempt above is that the slurm-llnl package on debian (and ubuntu) sets up the services with:

SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid

which is different than the default slurm.conf file locations. Once these two settings were adjusted, thing started working properly with the above instructions.