ChristianKniep / docker-compute

Docker image to spawn a compute node running slurmd (+munge), diamond (metric-gathering), sshd, supervisord, logstash-forwarder
MIT License
6 stars 4 forks source link

slurmctld enters a spawn-kill loop #1

Open ocramz opened 8 years ago

ocramz commented 8 years ago

I observe a loop in my fork (NB: I run docker-compute in a Travis instance) ; when all services in steady state, slurmctld keeps restarting, and the following two lines appear continuosly in the second part of the log:

slurmctld_1 | 2016-03-17 14:32:00,617 INFO spawned: 'slurmctld' with pid 9203
slurmctld_1 | 2016-03-17 14:32:00,652 INFO exited: slurmctld (exit status 0; not expected)

Full log, up to the beginning of the loop:

Pulling consul (qnib/consul:latest)...
latest: Pulling from qnib/consul
Digest: sha256:53b8ea7af183312ba70917f4b0f68d5631fced9ae3559d6e29923de78c7bdd52
Status: Downloaded newer image for qnib/consul:latest
Creating dockercompute_consul_1...
Pulling slurmctld (qnib/slurmctld:latest)...
latest: Pulling from qnib/slurmctld
Digest: sha256:81f8c2f2b8f07c92a2c1adca2bc2e2e70ef713ce2bee86cba845761e0254245a
Status: Downloaded newer image for qnib/slurmctld:latest
Creating dockercompute_slurmctld_1...
Pulling compute (qnib/compute:latest)...
latest: Pulling from qnib/compute
Digest: sha256:ce03ba5acd061dfa0aaaeeb48b2b72e9f802ef09df3dfda93c5f7f149ddc609a
Status: Downloaded newer image for qnib/compute:latest
Creating dockercompute_compute_1...
Attaching to dockercompute_consul_1, dockercompute_slurmctld_1, dockercompute_compute_1
consul_1    | 2016-03-17 14:20:15,886 CRIT Supervisor running as root (no user in config file)
consul_1    | 2016-03-17 14:20:15,887 WARN Included extra file "/etc/supervisord.d/consul.ini" during parsing
consul_1    | 2016-03-17 14:20:15,905 INFO RPC interface 'supervisor' initialized
consul_1    | 2016-03-17 14:20:15,905 CRIT Server 'unix_http_server' running without any HTTP authentication checking
consul_1    | 2016-03-17 14:20:15,905 INFO supervisord started with pid 13
consul_1    | 2016-03-17 14:20:16,907 INFO spawned: 'consul' with pid 16
consul_1    | 2016-03-17 14:20:22,447 INFO success: consul entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
compute_1   | 2016-03-17 14:21:39,568 CRIT Supervisor running as root (no user in config file)
compute_1   | 2016-03-17 14:21:39,568 WARN Included extra file "/etc/supervisord.d/slurmd.ini" during parsing
compute_1   | 2016-03-17 14:21:39,568 WARN Included extra file "/etc/supervisord.d/slurm_update.ini" during parsing
compute_1   | 2016-03-17 14:21:39,568 WARN Included extra file "/etc/supervisord.d/munged.ini" during parsing
compute_1   | 2016-03-17 14:21:39,568 WARN Included extra file "/etc/supervisord.d/watchpsutil.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/diamond.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/sensu-api.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/sensu-client.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/sensu-server.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/rsyslog_conf.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/rsyslog.ini" during parsing
compute_1   | 2016-03-17 14:21:39,569 WARN Included extra file "/etc/supervisord.d/consul.ini" during parsing
compute_1   | 2016-03-17 14:21:39,593 INFO RPC interface 'supervisor' initialized
compute_1   | 2016-03-17 14:21:39,593 CRIT Server 'unix_http_server' running without any HTTP authentication checking
compute_1   | 2016-03-17 14:21:39,593 INFO supervisord started with pid 13
slurmctld_1 | 2016-03-17 14:21:12,438 CRIT Supervisor running as root (no user in config file)
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/scratchsetup.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/slurmstats.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/slurmctld.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/slurm_update.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/munged.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/watchpsutil.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/diamond.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/sensu-api.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/sensu-client.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,438 WARN Included extra file "/etc/supervisord.d/sensu-server.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,439 WARN Included extra file "/etc/supervisord.d/rsyslog_conf.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,439 WARN Included extra file "/etc/supervisord.d/rsyslog.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,439 WARN Included extra file "/etc/supervisord.d/consul.ini" during parsing
slurmctld_1 | 2016-03-17 14:21:12,463 INFO RPC interface 'supervisor' initialized
slurmctld_1 | 2016-03-17 14:21:12,463 CRIT Server 'unix_http_server' running without any HTTP authentication checking
slurmctld_1 | 2016-03-17 14:21:12,464 INFO supervisord started with pid 13
slurmctld_1 | 2016-03-17 14:21:13,465 INFO spawned: 'diamond' with pid 16
slurmctld_1 | 2016-03-17 14:21:13,467 INFO spawned: 'slurmctld' with pid 17
slurmctld_1 | 2016-03-17 14:21:13,468 INFO spawned: 'slurmstats' with pid 18
slurmctld_1 | 2016-03-17 14:21:13,474 INFO spawned: 'consul' with pid 19
slurmctld_1 | 2016-03-17 14:21:13,478 INFO spawned: 'sratchsetup' with pid 21
slurmctld_1 | 2016-03-17 14:21:13,481 INFO spawned: 'rsyslog-conf' with pid 22
slurmctld_1 | 2016-03-17 14:21:13,487 INFO spawned: 'sensu-api' with pid 23
slurmctld_1 | 2016-03-17 14:21:13,488 INFO spawned: 'sensu-client' with pid 24
slurmctld_1 | 2016-03-17 14:21:13,509 INFO spawned: 'slurm_update' with pid 28
slurmctld_1 | 2016-03-17 14:21:13,515 INFO spawned: 'rsyslog' with pid 32
slurmctld_1 | 2016-03-17 14:21:13,524 INFO spawned: 'munged' with pid 34
slurmctld_1 | 2016-03-17 14:21:13,532 INFO spawned: 'watchpsutil' with pid 40
slurmctld_1 | 2016-03-17 14:21:13,534 INFO spawned: 'sensu-server' with pid 44
slurmctld_1 | 2016-03-17 14:21:13,543 INFO success: diamond entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:13,554 INFO exited: sratchsetup (exit status 1; not expected)
slurmctld_1 | 2016-03-17 14:21:13,610 INFO exited: sensu-server (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:13,615 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:14,519 INFO success: sensu-api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,519 INFO success: sensu-client entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,519 INFO success: slurm_update entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,519 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,519 INFO success: munged entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,529 INFO success: watchpsutil entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:14,537 INFO exited: diamond (exit status 0; expected)
slurmctld_1 | 2016-03-17 14:21:14,959 INFO spawned: 'slurmctld' with pid 412
slurmctld_1 | 2016-03-17 14:21:14,961 INFO spawned: 'sratchsetup' with pid 413
slurmctld_1 | 2016-03-17 14:21:14,962 INFO spawned: 'sensu-server' with pid 414
slurmctld_1 | 2016-03-17 14:21:15,030 INFO exited: sensu-server (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:15,057 INFO exited: sratchsetup (exit status 1; not expected)
slurmctld_1 | 2016-03-17 14:21:15,100 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:15,545 INFO exited: sensu-api (exit status 0; expected)
slurmctld_1 | 2016-03-17 14:21:17,847 INFO spawned: 'slurmctld' with pid 464
slurmctld_1 | 2016-03-17 14:21:17,876 INFO spawned: 'sratchsetup' with pid 465
slurmctld_1 | 2016-03-17 14:21:17,878 INFO spawned: 'sensu-server' with pid 466
slurmctld_1 | 2016-03-17 14:21:17,894 INFO exited: sratchsetup (exit status 1; not expected)
slurmctld_1 | 2016-03-17 14:21:17,905 INFO exited: sensu-server (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:17,907 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:18,543 INFO success: consul entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:20,983 INFO spawned: 'slurmctld' with pid 520
slurmctld_1 | 2016-03-17 14:21:20,984 INFO spawned: 'sratchsetup' with pid 523
slurmctld_1 | 2016-03-17 14:21:20,986 INFO spawned: 'sensu-server' with pid 524
slurmctld_1 | 2016-03-17 14:21:21,024 INFO exited: sratchsetup (exit status 1; not expected)
slurmctld_1 | 2016-03-17 14:21:21,026 INFO gave up: sratchsetup entered FATAL state, too many start retries too quickly
slurmctld_1 | 2016-03-17 14:21:21,030 INFO exited: sensu-server (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:21,042 INFO gave up: sensu-server entered FATAL state, too many start retries too quickly
slurmctld_1 | 2016-03-17 14:21:21,047 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:25,615 INFO spawned: 'slurmctld' with pid 594
slurmctld_1 | 2016-03-17 14:21:25,665 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:28,545 INFO success: slurmstats entered RUNNING state, process has stayed up for > than 15 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:28,545 INFO success: rsyslog-conf entered RUNNING state, process has stayed up for > than 15 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:21:31,479 INFO spawned: 'slurmctld' with pid 682
slurmctld_1 | 2016-03-17 14:21:31,520 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:37,581 INFO spawned: 'slurmctld' with pid 770
slurmctld_1 | 2016-03-17 14:21:37,615 INFO exited: slurmctld (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:40,596 INFO spawned: 'diamond' with pid 16
compute_1   | 2016-03-17 14:21:40,597 INFO spawned: 'consul' with pid 17
compute_1   | 2016-03-17 14:21:40,599 INFO spawned: 'rsyslog-conf' with pid 18
compute_1   | 2016-03-17 14:21:40,601 INFO spawned: 'sensu-api' with pid 19
compute_1   | 2016-03-17 14:21:40,603 INFO spawned: 'sensu-client' with pid 20
compute_1   | 2016-03-17 14:21:40,604 INFO spawned: 'slurm_update' with pid 21
compute_1   | 2016-03-17 14:21:40,614 INFO spawned: 'rsyslog' with pid 23
compute_1   | 2016-03-17 14:21:40,624 INFO spawned: 'slurmd' with pid 32
compute_1   | 2016-03-17 14:21:40,631 INFO spawned: 'munged' with pid 36
compute_1   | 2016-03-17 14:21:40,636 INFO spawned: 'watchpsutil' with pid 40
compute_1   | 2016-03-17 14:21:40,638 INFO spawned: 'sensu-server' with pid 44
compute_1   | 2016-03-17 14:21:40,639 INFO success: diamond entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
compute_1   | 2016-03-17 14:21:40,700 INFO exited: sensu-server (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:41,627 INFO success: sensu-api entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,627 INFO success: sensu-client entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,627 INFO success: slurm_update entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,628 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,628 INFO success: slurmd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,630 INFO success: munged entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,631 INFO exited: diamond (exit status 0; expected)
compute_1   | 2016-03-17 14:21:41,727 INFO success: watchpsutil entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
compute_1   | 2016-03-17 14:21:41,728 INFO spawned: 'sensu-server' with pid 693
compute_1   | 2016-03-17 14:21:41,742 INFO exited: sensu-server (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:42,656 INFO exited: sensu-api (exit status 0; expected)
compute_1   | 2016-03-17 14:21:43,812 INFO spawned: 'sensu-server' with pid 748
compute_1   | 2016-03-17 14:21:43,824 INFO exited: sensu-server (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:21:45,050 INFO spawned: 'slurmctld' with pid 866
slurmctld_1 | 2016-03-17 14:21:45,091 INFO exited: slurmctld (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:45,650 INFO success: consul entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
compute_1   | 2016-03-17 14:21:47,000 INFO spawned: 'sensu-server' with pid 800
compute_1   | 2016-03-17 14:21:47,018 INFO exited: sensu-server (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:47,047 INFO gave up: sensu-server entered FATAL state, too many start retries too quickly
slurmctld_1 | 2016-03-17 14:21:53,551 INFO spawned: 'slurmctld' with pid 981
slurmctld_1 | 2016-03-17 14:21:53,591 INFO exited: slurmctld (exit status 0; not expected)
compute_1   | 2016-03-17 14:21:55,652 INFO success: rsyslog-conf entered RUNNING state, process has stayed up for > than 15 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:22:03,534 INFO spawned: 'slurmctld' with pid 1119
slurmctld_1 | 2016-03-17 14:22:03,570 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:22:13,937 INFO spawned: 'slurmctld' with pid 1263
slurmctld_1 | 2016-03-17 14:22:13,978 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:22:25,045 INFO spawned: 'slurmctld' with pid 1415
slurmctld_1 | 2016-03-17 14:22:25,082 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:22:37,589 INFO spawned: 'slurmctld' with pid 1593
slurmctld_1 | 2016-03-17 14:22:37,629 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:22:50,951 INFO spawned: 'slurmctld' with pid 1782
slurmctld_1 | 2016-03-17 14:22:50,985 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:23:05,760 INFO spawned: 'slurmctld' with pid 1976
slurmctld_1 | 2016-03-17 14:23:05,798 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:23:16,647 INFO exited: sensu-client (exit status 1; not expected)
slurmctld_1 | 2016-03-17 14:23:17,649 INFO spawned: 'sensu-client' with pid 2134
slurmctld_1 | 2016-03-17 14:23:19,084 INFO success: sensu-client entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:23:21,320 INFO spawned: 'slurmctld' with pid 2173
slurmctld_1 | 2016-03-17 14:23:21,354 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:23:37,887 INFO spawned: 'slurmctld' with pid 2399
slurmctld_1 | 2016-03-17 14:23:37,922 INFO exited: slurmctld (exit status 0; not expected)
compute_1   | 2016-03-17 14:23:43,122 INFO exited: sensu-client (exit status 1; not expected)
compute_1   | 2016-03-17 14:23:43,477 INFO spawned: 'sensu-client' with pid 2395
compute_1   | 2016-03-17 14:23:44,824 INFO success: sensu-client entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
slurmctld_1 | 2016-03-17 14:23:55,099 INFO spawned: 'slurmctld' with pid 2642
slurmctld_1 | 2016-03-17 14:23:55,133 INFO exited: slurmctld (exit status 0; not expected)
slurmctld_1 | 2016-03-17 14:24:13,569 INFO spawned: 'slurmctld' with pid 2886
slurmctld_1 | 2016-03-17 14:24:13,604 INFO exited: slurmctld (exit status 0; not expected)
ChristianKniep commented 8 years ago

I took the freedom to format you quotes. I'll have a look. This week-end I am on my way to a conference and I have to take care of the slides first. I hope to get to it at the end of next week. Please remind me if I haven't done so.

Thx for the feed-back! I appreciate it...

EDIT: Could you access the slurmctld instance and supervisorctl stop slurmctld plus /usr/local/sbin/slurmctld -D -v -c? Not sure if this is easy to do in Travis...

ChristianKniep commented 8 years ago

Hey @ocramz,

I renamed the fig.yml file to docker-compose.yml and fixed Consul environment variables. Problem was, the Consul was not running a server, which blew up the slurm.conf creation.

➜  docker-compute git:(master) docker-compose up -d                                                                                                                                                                                                                                                                                      git:(master|)
Creating dockercompute_consul_1
Creating dockercompute_slurmctld_1
Creating dockercompute_compute_1
➜  docker-compute git:(master) docker exec -ti dockercompute_compute_1 sinfo                                                                                                                                                                                                                                                             git:(master|)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1   idle 2cafd9b00079
odd          up   infinite      1   idle 2cafd9b00079
➜  docker-compute git:(master) docker-compose scale compute=5                                                                                                                                                                                                                                                                            git:(master|)
Creating and starting 2 ... done
Creating and starting 3 ... done
Creating and starting 4 ... done
Creating and starting 5 ... done
➜  docker-compute git:(master) docker exec -ti dockercompute_compute_1 sinfo                                                                                                                                                                                                                                                             git:(master|)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      1   idle 2cafd9b00079
all*         up   infinite      1    unk e3586b65af05
odd          up   infinite      1   idle 2cafd9b00079
odd          up   infinite      1    unk e3586b65af05
➜  docker-compute git:(master) docker exec -ti dockercompute_compute_1 sinfo                                                                                                                                                                                                                                                             git:(master|)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2   idle 0a0c3ede689e,ee478ce106b1
all*         up   infinite      3    unk 2cafd9b00079,6cb0e426299f,e3586b65af05
odd          up   infinite      2   idle 0a0c3ede689e,ee478ce106b1
odd          up   infinite      2    unk 2cafd9b00079,e3586b65af05
even         up   infinite      1    unk 6cb0e426299f
➜  docker-compute git:(master) docker exec -ti dockercompute_compute_1 sinfo                                                                                                                                                                                                                                                             git:(master|)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      5   idle 0a0c3ede689e,2cafd9b00079,6cb0e426299f,e3586b65af05,ee478ce106b1
odd          up   infinite      4   idle 0a0c3ede689e,2cafd9b00079,e3586b65af05,ee478ce106b1
even         up   infinite      1   idle 6cb0e426299f
➜  docker-compute git:(master) docker exec -ti dockercompute_compute_1 srun -N5 hostname                                                                                                                                                                                                                                                 git:(master|)
ee478ce106b1
0a0c3ede689e
6cb0e426299f
e3586b65af05
2cafd9b00079
➜  docker-compute git:(master)

Please close the issue if it is solved for you as well. Thx again for the feed-back - I am depending on it!

ChristianKniep commented 8 years ago

I enhanced the README, if you could walk through it and check if it's consistend... I am a bit biased. :)