Open djeeg opened 6 years ago
If I now scale up to 4 MANAGERS
At least I have 3 now.
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
3r2bep02jw Down Active
bcksfpcam2 Down Active
dgcbalta3i Down Active
vp0r6wj5m0 Down Active
lfxw2f00b5 Down Active
n4ox0f13dt swarm-manager000000 Ready Active Reachable
i6kcjzg5wt * swarm-manager000002 Ready Drain Leader
isy1lqj6b1 swarm-manager000003 Ready Active Reachable
hwovd2ti36 swarm-worker000000 Ready Active
lkym8b4dz0 swarm-worker000001 Ready Active
Possibly there is a restriction on how many node can be scaled at the same time, though I dont see any mention in the docs. https://docs.docker.com/docker-for-azure/why/
For a reliable cluster to exist, the ideal number of managers is 3. When you scale up, can you ssh into the machines and see what the init
container log output is?
It'll hint as to what's happening and why the machines aren't joining.
Try to scale another manager today, and the new instance didnt join swarm
swarm-manager000004:~$ docker logs $(sudo docker ps -a | grep "agent-azure" | awk '{print $1}')
Only error I could see
2018-03-24 11:33:06,951 CRIT Set uid to user 0
2018-03-24 11:33:06,953 INFO supervisord started with pid 2154
2018-03-24 11:33:07,955 INFO spawned: 'walinuxagent' with pid 2182
2018-03-24 11:33:07,958 INFO spawned: 'sshd' with pid 2183
2018-03-24 11:33:08,039 DEBG 'walinuxagent' stdout output:
2018/03/24 11:33:08.039162 INFO Azure Linux Agent Version:2.2.0
2018-03-24 11:33:08,220 DEBG 'walinuxagent' stdout output:
2018/03/24 11:33:08.039883 INFO OS: alpine 3.5.0
2018/03/24 11:33:08.039971 INFO Python: 2.7.13
2018/03/24 11:33:08.040162 INFO Run daemon
2018/03/24 11:33:08.040506 INFO No RDMA handler exists for distro='' version='3.5.0'
2018/03/24 11:33:08.040640 INFO Clean protocol
2018/03/24 11:33:08.040772 INFO Running provisioning handler
2018/03/24 11:33:08.040838 INFO Copying ovf-env.xml
2018/03/24 11:33:08.041388 VERBOSE run cmd 'mount'
2018/03/24 11:33:08.044895 VERBOSE run cmd 'mount -o ro -t udf,iso9660 /dev/sr0 /mnt/cdrom/secure'
...................
2018-03-24 11:34:15,351 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.351412 ERROR Command: 'iptables --version'
2018-03-24 11:34:15,351 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.351591 ERROR Return code: 127
2018-03-24 11:34:15,352 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.351763 ERROR Result: /bin/sh: iptables: not found
2018-03-24 11:34:15,352 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.352025 WARNING Unable to determine version of iptables
2018/03/24 11:34:15.352160 WARNING Unable to retrieve firewall packets droppedUnable to determine version of iptables
Actually probably ignore that previous error, I see the same error on instances that have correctly joined the swarm
Comparing the last log entries of a failed instance to a successful instance, the failed instance logs just seems to stop after the first ProcessGoalState log entry
Failed
2018-03-24 11:34:15,369 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.368919 INFO WALinuxAgent-2.2.25 running as process 2537
2018-03-24 11:34:15,369 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.369102 INFO Purging disk cache, current incarnation is 1
2018-03-24 11:34:15,371 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.370697 INFO Event: name=WALinuxAgent, op=Partition, message=37, duration=0
2018-03-24 11:34:15,372 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.372531 INFO Event: name=WALinuxAgent, op=AutoUpdate, message=, duration=0
2018-03-24 11:34:15,373 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.373235 INFO Wire server endpoint:xxx.xx.xxx.xx
2018-03-24 11:34:15,420 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.420741 INFO Wire server endpoint:xxx.xx.xxx.xx
2018-03-24 11:34:15,464 DEBG 'walinuxagent' stdout output:
2018/03/24 11:34:15.464574 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 1, duration=43
Successful
2018-03-19 05:12:45,618 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.618820 INFO WALinuxAgent-2.2.23 running as process 2574
2018-03-19 05:12:45,619 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.619260 INFO Event: name=WALinuxAgent, op=Partition, message=61, duration=0
2018-03-19 05:12:45,622 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.622386 INFO Event: name=WALinuxAgent, op=AutoUpdate, message=, duration=0
2018-03-19 05:12:45,623 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.622930 INFO Wire server endpoint:xxx.xx.xxx.xx
2018-03-19 05:12:45,623 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.623571 INFO Wire server endpoint:xxx.xx.xxx.xx
2018-03-19 05:12:45,624 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.624104 INFO Purging disk cache, current incarnation is 1
2018-03-19 05:12:45,670 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.669835 INFO Wire server endpoint:xxx.xx.xxx.xx
2018-03-19 05:12:45,715 DEBG 'walinuxagent' stdout output:
2018/03/19 05:12:45.715651 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 1, duration=45
2018-03-19 06:02:17,117 DEBG 'walinuxagent' stdout output:
2018/03/19 06:02:17.117465 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 2, duration=147
2018-03-19 06:24:51,720 DEBG 'walinuxagent' stdout output:
2018/03/19 06:24:51.720614 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 3, duration=171
2018-03-19 06:33:42,730 DEBG 'walinuxagent' stdout output:
2018/03/19 06:33:42.730429 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 4, duration=146
2018-03-19 06:34:17,439 DEBG 'walinuxagent' stdout output:
2018/03/19 06:34:17.438830 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 5, duration=181
2018-03-19 07:52:31,639 DEBG 'walinuxagent' stdout output:
2018/03/19 07:52:31.639793 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 6, duration=167
2018-03-19 08:41:11,751 DEBG 'walinuxagent' stdout output:
2018/03/19 08:41:11.751718 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 7, duration=105
2018-03-19 08:42:14,127 DEBG 'walinuxagent' stdout output:
2018/03/19 08:42:14.127048 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 8, duration=157
Though on another new scale set instance, the logs end at op=ProcessGoalState, message=Incarnation 1, however this instance has joined the swarm
swarm-manager000005:~$docker logs $(sudo docker ps -a | grep "agent-azure" | awk '{print $1}')
2018-03-24 12:01:29,438 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.437905 INFO WALinuxAgent-2.2.25 running as process 2952
2018-03-24 12:01:29,438 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.438284 INFO Event: name=WALinuxAgent, op=Partition, message=43, duration=0
2018-03-24 12:01:29,439 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.439633 INFO Event: name=WALinuxAgent, op=AutoUpdate, message=, duration=0
2018-03-24 12:01:29,440 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.440102 INFO Wire server endpoint:168.63.129.16
2018-03-24 12:01:29,441 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.440608 INFO Wire server endpoint:168.63.129.16
2018/03/24 12:01:29.441176 INFO Purging disk cache, current incarnation is 1
2018-03-24 12:01:29,494 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.494207 INFO Wire server endpoint:168.63.129.16
2018-03-24 12:01:29,537 DEBG 'walinuxagent' stdout output:
2018/03/24 12:01:29.537504 INFO Event: name=WALinuxAgent, op=ProcessGoalState, message=Incarnation 1, duration=43
swarm-manager000003$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
isy1lqj6b1q * swarm-manager000003 Ready Active Leader
jdmyr4b9ke swarm-manager000005 Ready Active Reachable
hwovd2ti36 swarm-worker000000 Ready Active
@djeeg what about logs from the docker4x/init-azure
containers? What do those show in the instances that failed?
Ahh so there should be a container init-azure, I did wonder.
From what I recall the only container in docker ps -a
was agent-azure. Possibily that suggests to me that on the nodes that fail to link to the swarm, init-azure doesnt start.
Let me do some more scaling to see if I can confirm that behaviour is happening on these orphan nodes. The first scaled noded I tried today has correctly joined. =/
#===================================
Complete Swarm setup
Okay have got it to create these ghost nodes, seems if I scale more than 1 VM instance at a time it happens with better luck.
swarm-manager000003:$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
cf8gr7497g9 Unknown Active
ywmszmlrdy4 Unknown Active
d9udot67sm8 Unknown Active
m48raneect7 Unknown Active
vya5kkdbuvs Unknown Active
ckvmtzv708u swarm-manager00000B Ready Active Reachable
isy1lqj6b1r * swarm-manager000003 Ready Active Leader
hwovd2ti36e swarm-worker000000 Ready Active
Then the init logs of the failing VM instance.
Setup Swarm Manager
PRIVATE_IP=10.0.0.5
LEADER_IP=10.0.0.9
join as a swarm Manager
Joining as Swarm Manager
LEADER_IP=10.0.0.9
MANAGER_TOKEN=SWMTKN-1-1bbbvs0ym2996yf479ejevc<removed>
Error response from daemon: Timeout was reached before node joined. The attempt to join the swarm will continue in the background. Use the "docker info" command to see the current swarm status of your node.
SWARM_ID:
NODE:
Can't connect to leader, sleep and try again
Get Leader IP from Azure Table
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 85 100 85 0 0 85 0 0:00:01 --:--:-- 0:00:01 261
MANAGER_TOKEN=SWMTKN-1-1bbbvs0ym2996yf479ejevc<removed>
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
SWARM_ID:
NODE:
Can't connect to leader, sleep and try again
Get Leader IP from Azure Table
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 85 100 85 0 0 85 0 0:00:01 --:--:-- 0:00:01 302
MANAGER_TOKEN=SWMTKN-1-1bbbvs0ym2996yf479ejevc<removed>
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
SWARM_ID:
NODE:
Can't connect to leader, sleep and try again
Get Leader IP from Azure Table
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 85 100 85 0 0 85 0 0:00:01 --:--:-- 0:00:01 175
MANAGER_TOKEN=SWMTKN-1-1bbbvs0ym2996yf479ejevc<removed>
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
SWARM_ID:
NODE:
Can't connect to leader, sleep and try again
Expected behavior
Swarm nodes match nodes in VMSS in Azure
Actual behavior
Missing nodes Ghost nodes
Information
Started with 3 nodes
Scale manager nodes in Azure from 1 to 3
Wait a while, first new manager is detected
Wait a bit longer, something odd is going on here
Wait an hour, swarm still not formed, lots of extra nodes (put manager000002 in drain)