Open StefanScherer opened 7 years ago
I can see that the Linux swarm manager is connected to the Windows swarm manager:
PS C:\Users\stefan> docker node ls
Error response from daemon: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers
are online.
PS C:\Users\stefan> netstat -an | sls 237
TCP 0.0.0.0:2376 0.0.0.0:0 LISTENING
TCP 0.0.0.0:2377 0.0.0.0:0 LISTENING
TCP 10.0.2.6:58143 10.0.2.5:2377 ESTABLISHED
TCP [::]:2376 [::]:0 LISTENING
TCP [::]:2377 [::]:0 LISTENING
ping @aaronlehmann @aluzzardi @tiborvass PTAL
For reference, here is docker info
and docker version
of the Windows node:
PS C:\Users\stefan> docker version
Client:
Version: 17.06.0-ce-rc4
API version: 1.30
Go version: go1.8.3
Git commit: 29fcd5d
Built: Thu Jun 15 17:27:29 2017
OS/Arch: windows/amd64
Server:
Version: 17.06.0-ce-rc4
API version: 1.30 (minimum version 1.24)
Go version: go1.8.3
Git commit: 29fcd5d
Built: Thu Jun 15 17:39:44 2017
OS/Arch: windows/amd64
Experimental: true
PS C:\Users\stefan> docker info
Containers: 37
Running: 0
Paused: 0
Stopped: 37
Images: 32
Server Version: 17.06.0-ce-rc4
Storage Driver: windowsfilter
Windows:
Logging Driver: json-file
Plugins:
Volume: local
Network: l2bridge l2tunnel nat null overlay transparent
Log: awslogs etwlogs fluentd json-file logentries splunk syslog
Swarm: active
NodeID: luscn1z6tbh7ge814wttszqom
Error: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Is Manager: true
Node Address: 10.0.2.6
Manager Addresses:
10.0.2.5:2377
10.0.2.6:2377
Default Isolation: process
Kernel Version: 10.0 14393 (14393.1198.amd64fre.rs1_release_sec.170427-1353)
Operating System: Windows Server 2016 Datacenter
OSType: windows
Architecture: x86_64
CPUs: 2
Total Memory: 7GiB
Name: win-01
ID: WBV4:ATMJ:BL3V:GE3W:EFOP:6CKE:XB3N:N4JL:QZS4:CBVX:7AHJ:VWZD
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
PS C:\Users\stefan>
Daemon logs would be helpful here.
My suspicion is that it's not actually a platform issue, but might be more of a connectivity problem between the two managers. Is there a NAT involved?
I'm successfully running an rc4 linux/windows swarm (1 linux manager, 1 windows worker). This is on Azure using this template: https://github.com/friism/azure-test
Joining a Windows worker is no problem. I'll look up the logs from both nodes, but have to reconfigure Windows as it is buried in event log.
@aaronlehmann I have started both docker engines manually with -D option and created a swarm with a lin-01 manager and tried to join a win-01 manager node. Both logs are in this gist https://gist.github.com/StefanScherer/79cbf263a2060bd6ebf42f0042444f88
Both machines are in the same subnet, both running in Azure.
It seems like the nodes lose the ability to communicate with each other.
lin-01 gets timeouts when trying to communicate with win-01:
DEBU[0052] member which sent vote request failed health check error="failed to check health: rpc error: code = 4 desc = context deadline exceeded" from=32a30125ba8d5893 method="(*Node).ProcessRaftMessage" raft_id=14a516450d5929e2
DEBU[0053] failed to send message MsgVote error="rpc error: code = 4 desc = context deadline exceeded" peer_id=32a30125ba8d5893
DEBU[0054] failed to send message MsgVote error="rpc error: code = 4 desc = context deadline exceeded" peer_id=32a30125ba8d5893
win-01 gets timeouts when trying to communicate with lin-01:
time="2017-06-16T21:21:46.442563700Z" level=debug msg="failed to send message MsgVote" error="rpc error: code = 4 desc = context deadline exceeded" peer_id=14a516450d5929e2
and later on, isn't able to reestablish a connection:
time="2017-06-16T21:21:58.763817100Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 10.0.2.5:2377: connectex: No connection could be made because the target machine actively refused it.\"; Reconnecting to {10.0.2.5:2377 <nil>}" module=grpc
What happens if you try to telnet to 10.0.2.5 port 2377 from win-01?
I've done a fresh swarm init + swarm join and can run telnet 10.0.2.6 2377 from lin-01 to win-01 as well as telnet 10.0.2.5 2377 from win-01 to lin-01.
Hm, that works for me too. I get the timeout, but it joins eventually:
PS C:\Users\docker> docker swarm join --token SWMTKN-1-<manager-token> 10.0.144.5:2377
Error response from daemon: Timeout was reached before node was joined. The attempt to join the swarm will continue in t
he background. Use the "docker info" command to see the current swarm status of your node.
PS C:\Users\docker> docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
87oelpaxkdiwychr68w59vei9 ftest-mgr0 Ready Active Leader
cb4emf0w8l059fyp3vw5gv5g8 * ftest-wrk0 Ready Active Reachable
PS C:\Users\docker>
@StefanScherer can you try joining as a worker and then promoting it?
I'm guessing you already have the required ports open: https://docs.docker.com/engine/swarm/swarm-tutorial/#open-protocols-and-ports-between-the-hosts
(for the record, I don't think this should be a release blocker - if it ends up not working reliably in the release for some reason, we should document that managers needs to be either all windows or all linux)
@friism yes all required ports open.
Joining as a worker works, promoting the win-01 node also works.
@aaronlehmann is it a clue that one can join-then-promote but not join as manager?
I'm not sure.
Tried the other way: First run swarm init on win-01 node, then join lin-01 as manager node. This works.
I guess the problem with joining a Windows manager node has something to do with the Mac spoofing or other network internals that Windows enables in swarm mode. When the Windows node enters swarm mode then all existing network connections to that Windows node drop (like RDP refreshs after some seconds of freezing, a SSH/WinRM connection drops). Joining a Windows node as worker the network issue happens while becoming a worker, so no raft issue here. Then promoting it to a manager works as the network then works fine.
Just found out that joining the win-01 manager node works, but after about 16 minutes. I've added the lin-01 logfile at https://gist.github.com/StefanScherer/79cbf263a2060bd6ebf42f0042444f88#file-dockerd-lin-01-long-log
After all the swarm manager nodes are available and ready I can see a log message every 30 seconds:
level=info msg="Node join event for win-01-3dfe7a99068a/10.0.2.5"
Is this normal?
Strange, I have added the following firewall exceptions on the win-01 node, but these seem to make trouble joining the win-01 node as a manager.
New-NetFirewallRule -Protocol TCP -LocalPort 2377 -Direction Inbound -Action Allow -DisplayName "Docker swarm-mode cluster management TCP"
New-NetFirewallRule -Protocol TCP -LocalPort 7946 -Direction Inbound -Action Allow -DisplayName "Docker swarm-mode node communication TCP"
New-NetFirewallRule -Protocol UDP -LocalPort 7946 -Direction Inbound -Action Allow -DisplayName "Docker swarm-mode node communication UDP"
New-NetFirewallRule -Protocol UDP -LocalPort 4789 -Direction Inbound -Action Allow -DisplayName "Docker swarm-mode overlay network UDP"
I also have combined the two port 7946 rules to a single one with "any" protocol, but still the same. When I turn off the firewall completely, joining the win-01 node as a manager works without a problem.
So next thing I tried is to turn off the Windows firewall for the internal network between the Linux and Windows swarm nodes.
In Azure my internal network is the interface with name "Ethernet 3".
Set-NetConnectionProfile -InterfaceAlias "Ethernet 3" -NetworkCategory Private
Set-NetFirewallProfile -Name Private -Enabled False
But still, when I run docker swarm join with the manager token it also doesn't work.
As it worked with Docker 17.03 I think there is missing only a small thing like resending UDP packet, increasing a timeout etc.
Hi Stefan, I was wondering if you ever found a solution for above? I'm seeing a similar issue on Windows Server 2016 1607 (with all updates) running Docker 17.06.2-ee-5 in AWS EC2. I open up the ports in Windows Firewall as per Docker Install guidelines. The very second I perform the 'docker swarm init' or 'docker swarm join' command, the Remote Desktop Connection is dropped and never can be re-connected (even after VM reboots).
Have you experienced this with Docker Swarm on Windows Server in a cloud provider and are you aware of any workarounds?
@guydavis I haven't used AWS with Windows Server 2016, so I don't know if that is a known issue there.
In Azure you can do docker swarm init
- the RDP session or other network connections drop for a short time, but RDP client reconnects after a few seconds and I can work with the swarm manager.
To join other Windows managers it is more safe to just join them as worker and then promote them as manager.
Expected behavior
Joining a Windows node as manager to a Linux swarm manager should work
Actual behavior
After I join a Windows node with the manager token to a Linux swarm running one manager and worker the swarm is then broken, docker node ls does not work on Linux manager:
Information
WARNING: No swap limit support
stefan@lin-01:~$ docker version Client: Version: 17.06.0-ce-rc4 API version: 1.30 Go version: go1.8.3 Git commit: 29fcd5d Built: Thu Jun 15 17:28:00 2017 OS/Arch: linux/amd64
Server: Version: 17.06.0-ce-rc4 API version: 1.30 (minimum version 1.12) Go version: go1.8.3 Git commit: 29fcd5d Built: Thu Jun 15 17:25:54 2017 OS/Arch: linux/amd64 Experimental: false