Azure Scale is missing nodes and has ghost nodes

Expected behavior

Swarm nodes match nodes in VMSS in Azure

Actual behavior

Missing nodes Ghost nodes

Information

Started with 3 nodes

$ docker node ls
ID                            HOSTNAME              STATUS              AVAILABILITY        MANAGER STATUS
n4ox0f13d *   swarm-manager000000   Ready               Active              Leader
hwovd2ti3     swarm-worker000000    Ready               Active
lkym8b4dz     swarm-worker000001    Ready               Active

Scale manager nodes in Azure from 1 to 3

Wait a while, first new manager is detected

$ docker node ls
ID                            HOSTNAME              STATUS              AVAILABILITY        MANAGER STATUS
i6kcjzg5wt                          Unknown             Active              Reachable
n4ox0f13d *   swarm-manager000000   Ready               Active              Leader
hwovd2ti    swarm-worker000000    Ready               Active
lkym8b4d     swarm-worker000001    Ready               Active

Wait a bit longer, something odd is going on here

$ docker node ls
ID                            HOSTNAME              STATUS              AVAILABILITY        MANAGER STATUS
3r2bep02jw                         Down                Active
bcksfpcam2                           Unknown             Active
dgcbalta3i                           Unknown             Active
lfxw2f00b5                           Down                Active
vp0r6wj5m                          Unknown             Active
n4ox0f13d *   swarm-manager000000   Ready               Active              Reachable
i6kcjzg5wt8     swarm-manager000002   Ready               Active              Leader
hwovd2ti36     swarm-worker000000    Ready               Active
lkym8b4dz0     swarm-worker000001    Ready               Active

Wait an hour, swarm still not formed, lots of extra nodes (put manager000002 in drain)

$ docker node ls
ID                            HOSTNAME              STATUS              AVAILABILITY        MANAGER STATUS
3r2bep02jw                          Down                Active
bcksfpcam2                         Down                Active
dgcbalta3if                           Down                Active
lfxw2f00b5m                        Down                Active
vp0r6wj5m                           Down                Active
n4ox0f13dt     swarm-manager000000   Ready               Active              Reachable
i6kcjzg5wt8 *   swarm-manager000002   Ready               Drain               Leader
hwovd2ti36     swarm-worker000000    Ready               Active
lkym8b4dz0     swarm-worker000001    Ready               Active

If I now scale up to 4 MANAGERS

At least I have 3 now.

$ docker node ls
ID                            HOSTNAME              STATUS              AVAILABILITY        MANAGER STATUS
3r2bep02jw                           Down                Active
bcksfpcam2                           Down                Active
dgcbalta3i                           Down                Active
vp0r6wj5m0                           Down                Active
lfxw2f00b5                           Down                Active
n4ox0f13dt     swarm-manager000000   Ready               Active              Reachable
i6kcjzg5wt *   swarm-manager000002   Ready               Drain               Leader
isy1lqj6b1     swarm-manager000003   Ready               Active              Reachable
hwovd2ti36     swarm-worker000000    Ready               Active
lkym8b4dz0     swarm-worker000001    Ready               Active

Possibly there is a restriction on how many node can be scaled at the same time, though I dont see any mention in the docs. https://docs.docker.com/docker-for-azure/why/

For a reliable cluster to exist, the ideal number of managers is 3. When you scale up, can you ssh into the machines and see what the init container log output is? It'll hint as to what's happening and why the machines aren't joining.

Try to scale another manager today, and the new instance didnt join swarm

swarm-manager000004:~$ docker logs $(sudo docker ps -a | grep "agent-azure" | awk '{print $1}')

Only error I could see

2018-03-24 11:33:06,951 CRIT Set uid to user 0
2018-03-24 11:33:06,953 INFO supervisord started with pid 2154
2018-03-24 11:33:07,955 INFO spawned: 'walinuxagent' with pid 2182
2018-03-24 11:33:07,958 INFO spawned: 'sshd' with pid 2183
2018-03-24 11:33:08,039 DEBG 'walinuxagent' stdout output:
2018/03/24 11:33:08.039162 INFO Azure Linux Agent Version:2.2.0

2018-03-24 11:33:08,220 DEBG 'walinuxagent' stdout output:
2018/03/24 11:33:08.039883 INFO OS: alpine 3.5.0
2018/03/24 11:33:08.039971 INFO Python: 2.7.13
2018/03/24 11:33:08.040162 INFO Run daemon
2018/03/24 11:33:08.040506 INFO No RDMA handler exists for distro='' version='3.5.0'
2018/03/24 11:33:08.040640 INFO Clean protocol
2018/03/24 11:33:08.040772 INFO Running provisioning handler
2018/03/24 11:33:08.040838 INFO Copying ovf-env.xml
2018/03/24 11:33:08.041388 VERBOSE run cmd 'mount'
2018/03/24 11:33:08.044895 VERBOSE run cmd 'mount -o ro -t udf,iso9660 /dev/sr0 /mnt/cdrom/secure'

...................

2018-03-24 11:34:15,351 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.351412 ERROR Command: 'iptables --version'

2018-03-24 11:34:15,351 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.351591 ERROR Return code: 127

2018-03-24 11:34:15,352 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.351763 ERROR Result: /bin/sh: iptables: not found

2018-03-24 11:34:15,352 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.352025 WARNING Unable to determine version of iptables
2018/03/24 11:34:15.352160 WARNING Unable to retrieve firewall packets droppedUnable to determine version of iptables

Actually probably ignore that previous error, I see the same error on instances that have correctly joined the swarm

Comparing the last log entries of a failed instance to a successful instance, the failed instance logs just seems to stop after the first ProcessGoalState log entry

Failed

2018-03-24 11:34:15,369 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.368919 INFO WALinuxAgent-2.2.25 running as process 2537

2018-03-24 11:34:15,369 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.369102 INFO Purging disk cache, current incarnation is 1

2018-03-24 11:34:15,371 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.370697 INFO Event: name=WALinuxAgent, op=Partition, message=37, duration=0

2018-03-24 11:34:15,372 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.372531 INFO Event: name=WALinuxAgent, op=AutoUpdate, message=, duration=0

2018-03-24 11:34:15,373 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.373235 INFO Wire server endpoint:xxx.xx.xxx.xx

2018-03-24 11:34:15,420 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.420741 INFO Wire server endpoint:xxx.xx.xxx.xx

2018-03-24 11:34:15,464 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.464574 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 1, duration=43

Successful

2018-03-19 05:12:45,618 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.618820 INFO WALinuxAgent-2.2.23 running as process 2574

2018-03-19 05:12:45,619 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.619260 INFO Event: name=WALinuxAgent, op=Partition, message=61, duration=0

2018-03-19 05:12:45,622 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.622386 INFO Event: name=WALinuxAgent, op=AutoUpdate, message=, duration=0

2018-03-19 05:12:45,623 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.622930 INFO Wire server endpoint:xxx.xx.xxx.xx

2018-03-19 05:12:45,623 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.623571 INFO Wire server endpoint:xxx.xx.xxx.xx

2018-03-19 05:12:45,624 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.624104 INFO Purging disk cache, current incarnation is 1

2018-03-19 05:12:45,670 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.669835 INFO Wire server endpoint:xxx.xx.xxx.xx

2018-03-19 05:12:45,715 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.715651 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 1, duration=45

2018-03-19 06:02:17,117 DEBG 'walinuxagent' stdout output:
2018/03/19 06:02:17.117465 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 2, duration=147

2018-03-19 06:24:51,720 DEBG 'walinuxagent' stdout output:
2018/03/19 06:24:51.720614 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 3, duration=171

2018-03-19 06:33:42,730 DEBG 'walinuxagent' stdout output:
2018/03/19 06:33:42.730429 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 4, duration=146

2018-03-19 06:34:17,439 DEBG 'walinuxagent' stdout output:
2018/03/19 06:34:17.438830 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 5, duration=181

2018-03-19 07:52:31,639 DEBG 'walinuxagent' stdout output:
2018/03/19 07:52:31.639793 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 6, duration=167

2018-03-19 08:41:11,751 DEBG 'walinuxagent' stdout output:
2018/03/19 08:41:11.751718 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 7, duration=105

2018-03-19 08:42:14,127 DEBG 'walinuxagent' stdout output:
2018/03/19 08:42:14.127048 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 8, duration=157

Though on another new scale set instance, the logs end at op=ProcessGoalState, message=Incarnation 1, however this instance has joined the swarm

swarm-manager000005:~$docker logs $(sudo docker ps -a | grep "agent-azure" | awk '{print $1}')

2018-03-24 12:01:29,438 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.437905 INFO WALinuxAgent-2.2.25 running as process 2952

2018-03-24 12:01:29,438 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.438284 INFO Event: name=WALinuxAgent, op=Partition, message=43, duration=0

2018-03-24 12:01:29,439 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.439633 INFO Event: name=WALinuxAgent, op=AutoUpdate, message=, duration=0

2018-03-24 12:01:29,440 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.440102 INFO Wire server endpoint:168.63.129.16

2018-03-24 12:01:29,441 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.440608 INFO Wire server endpoint:168.63.129.16
2018/03/24 12:01:29.441176 INFO Purging disk cache, current incarnation is 1

2018-03-24 12:01:29,494 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.494207 INFO Wire server endpoint:168.63.129.16

2018-03-24 12:01:29,537 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.537504 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 1, duration=43

swarm-manager000003$ docker node ls

ID                            HOSTNAME              STATUS              AVAILABILITY        MANAGER STATUS
isy1lqj6b1q *   swarm-manager000003   Ready               Active              Leader
jdmyr4b9ke     swarm-manager000005   Ready               Active              Reachable
hwovd2ti36     swarm-worker000000    Ready               Active

@djeeg what about logs from the docker4x/init-azure containers? What do those show in the instances that failed?

Ahh so there should be a container init-azure, I did wonder.

From what I recall the only container in docker ps -a was agent-azure. Possibily that suggests to me that on the nodes that fail to link to the swarm, init-azure doesnt start.

Let me do some more scaling to see if I can confirm that behaviour is happening on these orphan nodes. The first scaled noded I tried today has correctly joined. =/

#===================================
Complete Swarm setup

Okay have got it to create these ghost nodes, seems if I scale more than 1 VM instance at a time it happens with better luck.

swarm-manager000003:$ docker node ls
ID                            HOSTNAME              STATUS              AVAILABILITY        MANAGER STATUS
cf8gr7497g9                           Unknown             Active
ywmszmlrdy4                           Unknown             Active
d9udot67sm8                           Unknown             Active
m48raneect7                           Unknown             Active
vya5kkdbuvs                           Unknown             Active
ckvmtzv708u     swarm-manager00000B   Ready               Active              Reachable
isy1lqj6b1r *   swarm-manager000003   Ready               Active              Leader
hwovd2ti36e     swarm-worker000000    Ready               Active

Then the init logs of the failing VM instance.

Setup Swarm Manager
   PRIVATE_IP=10.0.0.5
   LEADER_IP=10.0.0.9
   join as a swarm Manager
   Joining as Swarm Manager
   LEADER_IP=10.0.0.9
   MANAGER_TOKEN=SWMTKN-1-1bbbvs0ym2996yf479ejevc<removed>
Error response from daemon: Timeout was reached before node joined. The attempt to join the swarm will continue in the background. Use the "docker info" command to see the current swarm status of your node.
SWARM_ID:
NODE:
Can't connect to leader, sleep and try again
Get Leader IP from Azure Table
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    85  100    85    0     0     85      0  0:00:01 --:--:--  0:00:01   261
MANAGER_TOKEN=SWMTKN-1-1bbbvs0ym2996yf479ejevc<removed>
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
SWARM_ID:
NODE:
Can't connect to leader, sleep and try again
Get Leader IP from Azure Table
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    85  100    85    0     0     85      0  0:00:01 --:--:--  0:00:01   302
MANAGER_TOKEN=SWMTKN-1-1bbbvs0ym2996yf479ejevc<removed>
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
SWARM_ID:
NODE:
Can't connect to leader, sleep and try again
Get Leader IP from Azure Table
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    85  100    85    0     0     85      0  0:00:01 --:--:--  0:00:01   175
MANAGER_TOKEN=SWMTKN-1-1bbbvs0ym2996yf479ejevc<removed>
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
SWARM_ID:
NODE:
Can't connect to leader, sleep and try again

docker-archive / for-azure