17.06 RC4: docker swarm joining a Windows manager node does not work

StefanScherer commented 7 years ago

Expected behavior

Joining a Windows node as manager to a Linux swarm manager should work

Actual behavior

After I join a Windows node with the manager token to a Linux swarm running one manager and worker the swarm is then broken, docker node ls does not work on Linux manager:

stefan@lin-01:~$ docker node ls
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
stefan@lin-01:~$ docker node ls
Error response from daemon: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.

Information

Linux distro, e.g. Ubuntu Xenial


stefan@lin-01:~$ docker info
Containers: 7
Running: 0
Paused: 0
Stopped: 7
Images: 2
Server Version: 17.06.0-ce-rc4
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 22
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: xk2zwsh6w2hs6wtfc48zllwbi
Error: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Is Manager: true
Node Address: 10.0.2.5
Manager Addresses:
10.0.2.5:2377
10.0.2.6:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfb82a876ecc11b5ca0977d1733adbe58599088a
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-78-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 6.795GiB
Name: lin-01
ID: KUAV:IAU6:AMZH:JWLL:CAHT:KBS2:Z3IQ:EB74:SRQE:XF23:WKIH:QDJC
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
provider=generic
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support


  - Docker CE version, can be found from output of `docker vesion`

stefan@lin-01:~$ docker version Client: Version: 17.06.0-ce-rc4 API version: 1.30 Go version: go1.8.3 Git commit: 29fcd5d Built: Thu Jun 15 17:28:00 2017 OS/Arch: linux/amd64

Server: Version: 17.06.0-ce-rc4 API version: 1.30 (minimum version 1.12) Go version: go1.8.3 Git commit: 29fcd5d Built: Thu Jun 15 17:25:54 2017 OS/Arch: linux/amd64 Experimental: false


  - A reproducible case if this is a bug, Dockerfiles FTW
  - Page URL if this is a docs issue or the name of a man page

### Steps to reproduce the behavior

  1. on lin-01 run `docker swarm init`
  2. on lin-02 run `docker swarm join` with worker token
  3. on lin-01 run `docker swarm join-token manager`
  4. on win-01 run `docker swawrm join` with manager token
  5. on lin-01 run `docker node ls` -> it will hang

It works when I use 17.03 on the Linux swarm manager node and 17.06.0-ce-rc4 on the Windows node.
But it does not work using 17.06.0-ce-rc4 also on the Linux node.

StefanScherer commented 7 years ago

I can see that the Linux swarm manager is connected to the Windows swarm manager:

PS C:\Users\stefan> docker node ls
Error response from daemon: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers
are online.
PS C:\Users\stefan> netstat -an | sls 237

  TCP    0.0.0.0:2376           0.0.0.0:0              LISTENING
  TCP    0.0.0.0:2377           0.0.0.0:0              LISTENING
  TCP    10.0.2.6:58143         10.0.2.5:2377          ESTABLISHED
  TCP    [::]:2376              [::]:0                 LISTENING
  TCP    [::]:2377              [::]:0                 LISTENING

thaJeztah commented 7 years ago

ping @aaronlehmann @aluzzardi @tiborvass PTAL

StefanScherer commented 7 years ago

For reference, here is docker info and docker version of the Windows node:

PS C:\Users\stefan> docker version
Client:
 Version:      17.06.0-ce-rc4
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   29fcd5d
 Built:        Thu Jun 15 17:27:29 2017
 OS/Arch:      windows/amd64

Server:
 Version:      17.06.0-ce-rc4
 API version:  1.30 (minimum version 1.24)
 Go version:   go1.8.3
 Git commit:   29fcd5d
 Built:        Thu Jun 15 17:39:44 2017
 OS/Arch:      windows/amd64
 Experimental: true
PS C:\Users\stefan> docker info
Containers: 37
 Running: 0
 Paused: 0
 Stopped: 37
Images: 32
Server Version: 17.06.0-ce-rc4
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: l2bridge l2tunnel nat null overlay transparent
 Log: awslogs etwlogs fluentd json-file logentries splunk syslog
Swarm: active
 NodeID: luscn1z6tbh7ge814wttszqom
 Error: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
 Is Manager: true
 Node Address: 10.0.2.6
 Manager Addresses:
  10.0.2.5:2377
  10.0.2.6:2377
Default Isolation: process
Kernel Version: 10.0 14393 (14393.1198.amd64fre.rs1_release_sec.170427-1353)
Operating System: Windows Server 2016 Datacenter
OSType: windows
Architecture: x86_64
CPUs: 2
Total Memory: 7GiB
Name: win-01
ID: WBV4:ATMJ:BL3V:GE3W:EFOP:6CKE:XB3N:N4JL:QZS4:CBVX:7AHJ:VWZD
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

PS C:\Users\stefan>

aaronlehmann commented 7 years ago

Daemon logs would be helpful here.

My suspicion is that it's not actually a platform issue, but might be more of a connectivity problem between the two managers. Is there a NAT involved?

friism commented 7 years ago

I'm successfully running an rc4 linux/windows swarm (1 linux manager, 1 windows worker). This is on Azure using this template: https://github.com/friism/azure-test

StefanScherer commented 7 years ago

Joining a Windows worker is no problem. I'll look up the logs from both nodes, but have to reconfigure Windows as it is buried in event log.

StefanScherer commented 7 years ago

@aaronlehmann I have started both docker engines manually with -D option and created a swarm with a lin-01 manager and tried to join a win-01 manager node. Both logs are in this gist https://gist.github.com/StefanScherer/79cbf263a2060bd6ebf42f0042444f88

StefanScherer commented 7 years ago

Both machines are in the same subnet, both running in Azure.

aaronlehmann commented 7 years ago

It seems like the nodes lose the ability to communicate with each other.

lin-01 gets timeouts when trying to communicate with win-01:

DEBU[0052] member which sent vote request failed health check  error="failed to check health: rpc error: code = 4 desc = context deadline exceeded" from=32a30125ba8d5893 method="(*Node).ProcessRaftMessage" raft_id=14a516450d5929e2
DEBU[0053] failed to send message MsgVote                error="rpc error: code = 4 desc = context deadline exceeded" peer_id=32a30125ba8d5893
DEBU[0054] failed to send message MsgVote                error="rpc error: code = 4 desc = context deadline exceeded" peer_id=32a30125ba8d5893

win-01 gets timeouts when trying to communicate with lin-01:

time="2017-06-16T21:21:46.442563700Z" level=debug msg="failed to send message MsgVote" error="rpc error: code = 4 desc = context deadline exceeded" peer_id=14a516450d5929e2

and later on, isn't able to reestablish a connection:

time="2017-06-16T21:21:58.763817100Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 10.0.2.5:2377: connectex: No connection could be made because the target machine actively refused it.\"; Reconnecting to {10.0.2.5:2377 <nil>}" module=grpc

What happens if you try to telnet to 10.0.2.5 port 2377 from win-01?

StefanScherer commented 7 years ago

I've done a fresh swarm init + swarm join and can run telnet 10.0.2.6 2377 from lin-01 to win-01 as well as telnet 10.0.2.5 2377 from win-01 to lin-01.

friism commented 7 years ago

Hm, that works for me too. I get the timeout, but it joins eventually:

PS C:\Users\docker> docker swarm join --token SWMTKN-1-<manager-token> 10.0.144.5:2377
Error response from daemon: Timeout was reached before node was joined. The attempt to join the swarm will continue in t
he background. Use the "docker info" command to see the current swarm status of your node.
PS C:\Users\docker> docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
87oelpaxkdiwychr68w59vei9     ftest-mgr0          Ready               Active              Leader
cb4emf0w8l059fyp3vw5gv5g8 *   ftest-wrk0          Ready               Active              Reachable
PS C:\Users\docker>

@StefanScherer can you try joining as a worker and then promoting it?

I'm guessing you already have the required ports open: https://docs.docker.com/engine/swarm/swarm-tutorial/#open-protocols-and-ports-between-the-hosts

friism commented 7 years ago

(for the record, I don't think this should be a release blocker - if it ends up not working reliably in the release for some reason, we should document that managers needs to be either all windows or all linux)

StefanScherer commented 7 years ago

@friism yes all required ports open.

Joining as a worker works, promoting the win-01 node also works.

friism commented 7 years ago

@aaronlehmann is it a clue that one can join-then-promote but not join as manager?

aaronlehmann commented 7 years ago

I'm not sure.

StefanScherer commented 7 years ago

Tried the other way: First run swarm init on win-01 node, then join lin-01 as manager node. This works.

I guess the problem with joining a Windows manager node has something to do with the Mac spoofing or other network internals that Windows enables in swarm mode. When the Windows node enters swarm mode then all existing network connections to that Windows node drop (like RDP refreshs after some seconds of freezing, a SSH/WinRM connection drops). Joining a Windows node as worker the network issue happens while becoming a worker, so no raft issue here. Then promoting it to a manager works as the network then works fine.

StefanScherer commented 7 years ago

Just found out that joining the win-01 manager node works, but after about 16 minutes. I've added the lin-01 logfile at https://gist.github.com/StefanScherer/79cbf263a2060bd6ebf42f0042444f88#file-dockerd-lin-01-long-log

After all the swarm manager nodes are available and ready I can see a log message every 30 seconds:

level=info msg="Node join event for win-01-3dfe7a99068a/10.0.2.5"

Is this normal?

StefanScherer commented 7 years ago

Strange, I have added the following firewall exceptions on the win-01 node, but these seem to make trouble joining the win-01 node as a manager.

New-NetFirewallRule -Protocol TCP -LocalPort 2377 -Direction Inbound -Action Allow -DisplayName "Docker swarm-mode cluster management TCP"
New-NetFirewallRule -Protocol TCP -LocalPort 7946 -Direction Inbound -Action Allow -DisplayName "Docker swarm-mode node communication TCP"
New-NetFirewallRule -Protocol UDP -LocalPort 7946 -Direction Inbound -Action Allow -DisplayName "Docker swarm-mode node communication UDP"
New-NetFirewallRule -Protocol UDP -LocalPort 4789 -Direction Inbound -Action Allow -DisplayName "Docker swarm-mode overlay network UDP"

I also have combined the two port 7946 rules to a single one with "any" protocol, but still the same. When I turn off the firewall completely, joining the win-01 node as a manager works without a problem.

So next thing I tried is to turn off the Windows firewall for the internal network between the Linux and Windows swarm nodes.

In Azure my internal network is the interface with name "Ethernet 3".

Set-NetConnectionProfile -InterfaceAlias "Ethernet 3" -NetworkCategory Private
Set-NetFirewallProfile -Name Private -Enabled False

But still, when I run docker swarm join with the manager token it also doesn't work.

As it worked with Docker 17.03 I think there is missing only a small thing like resending UDP packet, increasing a timeout etc.

guydavis commented 7 years ago

Hi Stefan, I was wondering if you ever found a solution for above? I'm seeing a similar issue on Windows Server 2016 1607 (with all updates) running Docker 17.06.2-ee-5 in AWS EC2. I open up the ports in Windows Firewall as per Docker Install guidelines. The very second I perform the 'docker swarm init' or 'docker swarm join' command, the Remote Desktop Connection is dropped and never can be re-connected (even after VM reboots).

Have you experienced this with Docker Swarm on Windows Server in a cloud provider and are you aware of any workarounds?

StefanScherer commented 7 years ago

@guydavis I haven't used AWS with Windows Server 2016, so I don't know if that is a known issue there. In Azure you can do docker swarm init - the RDP session or other network connections drop for a short time, but RDP client reconnects after a few seconds and I can work with the swarm manager.

To join other Windows managers it is more safe to just join them as worker and then promote them as manager.

docker / for-linux