Unable to recover cluster with "swarm init --force-new-cluster"

tdterry commented 7 years ago

[x] This is a bug report
[ ] This is a feature request
[x] I searched existing issues before opening this one

Expected behavior

I was trying to move nodes from one swarm cluster to another, and I ended up with a broken manager quorum. To fix this, I did swarm init --force-new-cluster on one of the managers. I expected this to create a new cluster with the existing database that I could then rejoin.

Actual behavior

When I executed the init, I got an error. After that, the swarm is completely broken.

$ docker -H host:port swarm init --force-new-cluster
Error response from daemon: context deadline exceeded

$ docker -H host:port node ls
Error response from daemon: rpc error: code = 14 desc = grpc: the connection is unavailable

Steps to reproduce the behavior

This is all rather complicated, and I am not entirely sure how I it happened. I tried to reproduce using a smaller test with fresh swarms and only a single node in each, but I wasn't able to exhibit the error. Below are my original steps.

I started with two swarms, and I was trying to merge them together. Node names have been shorted for readability.

Swarm A has 3 nodes, all managers (node1, node2, node3).

Swarm B has 5 nodes. node4 is a manager, node5 to node8 are workers. My plan was to join nodes 1-3 to the second swarm.

I attempted to join node2 to Swarm B, but I forgot to demote and remove it from Swarm A first.

$ docker -H node2:port swarm join --token REDACTED node4:2377
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.

$ docker -H node2:port swarm leave
Error response from daemon: You are attempting to leave the swarm on a node that is participating as a manager. The only way to restore a swarm that has lost consensus is to reinitialize it with `--force-new-cluster`. Use `--force` to suppress this message.

$ docker -H node1:port node demote node2
# I didn't save this message, but the command succeeded

$ docker -H node2:port swarm leave
Node left the swarm.

$ docker -H node2:port swarm join --token REDACTED node4:2377
Error response from daemon: Timeout was reached before node was joined. The attempt to join the swarm will continue in the background. Use the "docker info" command to see the current swarm status of your node.

At this point, Swarm B shows two managers, but one is unreachable.

$ docker -H node4:port node ls
ID                            HOSTNAME   STATUS              AVAILABILITY        MANAGER STATUS
35es00emgn50nto56224cmsnl     node5      Ready               Active
9ev2v9qpbmrkbd5t5vt9acgok                Unknown             Active              Unreachable
hshl2ko07e6vmyk7rxco2xt6u *   node4      Ready               Active              Leader
pvab5mjjjtsq3qws6sb228e9m     node6      Ready               Active
xjri4xj715y9093jbo5b6m9pa     node8      Ready               Active
y7wdkp6b0n0rzz190mxx0n57q     node7      Ready               Active

Presumably, 9ev2v9qpbmrkbd5t5vt9acgok is the failed node2. I tried a few more join commands on node2, restarted it, etc. Nothing changed.

Swarm B is broken because it is missing one of two managers, so I tried to recover by reinitializing the swarm, and that failed.

$ docker -H node4:port swarm init --force-new-cluster
Error response from daemon: context deadline exceeded

And now Swarm B is more broken, because it can't even list nodes anymore.

$ docker node4:port node ls
Error response from daemon: rpc error: code = 14 desc = grpc: the connection is unavailable

Output of docker version:

Node 4 (Swarm B)

$ docker -H node4:port version
Client:
 Version:      17.06.0-ce
 API version:  1.27 (downgraded from 1.30)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:31:53 2017
 OS/Arch:      darwin/amd64

Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:14:09 2017
 OS/Arch:      linux/amd64
 Experimental: false

Node 2 (Swarm A)

$ docker -H node2:port version
Client:
 Version:      17.06.0-ce
 API version:  1.27 (downgraded from 1.30)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:31:53 2017
 OS/Arch:      darwin/amd64

Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:14:09 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Node 4 (Swarm B) Note: swarm still thinks there are two managers. The IP addresses are node4 and node2.

$ docker -H node4:port info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 15
Server Version: 17.03.1-ce
Storage Driver: aufs
 Root Dir: /local/docker/aufs
 Backing Filesystem: extfs
 Dirs: 96
 Dirperm1 Supported: true
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log:
Swarm: active
 NodeID: hshl2ko07e6vmyk7rxco2xt6u
 Error: rpc error: code = 14 desc = grpc: the connection is unavailable
 Is Manager: true
 Node Address: 10.29.22.204
 Manager Addresses:
  10.29.0.31:2377
  10.29.22.204:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-21-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.795GiB
Name: dev-reapp104.ihrcloud.net
ID: XBRI:7BL3:HWNH:SBO7:KM2C:HAQC:HLIE:A5FE:NFMG:GP2F:NM3W:BGQP
Docker Root Dir: /local/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 445
 Goroutines: 1166
 System Time: 2017-08-10T13:03:06.016913002-04:00
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 dev-radioedit-registry.ihrcloud.net:5000
 radioedit-registry.ihrprod.net:5000
 127.0.0.0/8
Live Restore Enabled: false

Node 2 (Swarm A)

$ docker -H node2:port info
Containers: 8
 Running: 0
 Paused: 0
 Stopped: 8
Images: 12
Server Version: 17.03.1-ce
Storage Driver: aufs
 Root Dir: /local/docker/aufs
 Backing Filesystem: extfs
 Dirs: 126
 Dirperm1 Supported: true
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: pending
 NodeID: 9ev2v9qpbmrkbd5t5vt9acgok
 Is Manager: true
 ClusterID: ml35o1ubjtq475m0p6lljl4fi
 Managers: 2
 Nodes: 6
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.29.0.31
 Manager Addresses:
  0.0.0.0:2377
  10.29.22.204:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-79-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.795 GiB
Name: dev-reapp102.ihrcloud.net
ID: HEIS:ZVMD:MY7J:UAWP:6QRW:WUYD:F364:W4C6:POBK:TQIM:OOB6:5LI6
Docker Root Dir: /local/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 28
 Goroutines: 85
 System Time: 2017-08-10T11:30:49.533356982-04:00
 EventsListeners: 0
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
Experimental: false
Insecure Registries:
 dev-radioedit-registry.ihrcloud.net:5000
 radioedit-registry.ihrprod.net:5000
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.) My local machine is a Mac Docker CE 17.06.0-ce. My remote hosts are EC2 instances running Docker CE 17.03.1-ce.

zeh235 commented 2 years ago

i also have this problem, docker swarm init --force-new-cluster throws and error and leaves the cluster in an unusable state

encryptblockr commented 2 years ago

this is why docker swarm is never to be taken seriously if you use it in production, you have yourself to blame!!!

andersonphiri commented 1 year ago

For a single node cluster. After restarting the node, everything is broken, docker swarm init --force-new-cluster will throw an error saying address in use

docker / for-linux