docker-archive / for-aws

92 stars 26 forks source link

Master nodes unable to communicate or reach quorum after version upgrade, resulting in swarm failure #178

Open mateodelnorte opened 5 years ago

mateodelnorte commented 5 years ago

Expected behavior

Simple and easy swarm upgrades. ™

Actual behavior

We underwent an upgrade yesterday on our production swarm from 17.12.0 to 17.12.1, successfully. After doing so, we attempted to upgrade the same swarm to 18.03.0. Early in the second migration, we noticed the newly created manager was unable to connect to the swarm. Upon noticing, we immediately halted and rolled back the migration.

Despite the rollback, our swarm was left in a state such that new managers could not reach quorum. The existing managers then decided there were not sufficient managers participating and stopped responding to docker node, docker service, etc, commands. Any new managers brought online would come online and consider themselves a single manager in a new swarm. Multiple --force-new-cluster commands on remaining managers that retained remnants of service state were unsuccessful. Eventually, our last manager with retained state lost state after a --force-new-cluster command. In some cases, managers could not communicate at all and only saw themselves as single instance swarms. In some, the IPs of a manager would end up in the list of another, but would still regard themselves as not communicating.

We noticed that most of the newly brought online managers had empty /var/lib/docker/swarm directories.

Information

Note, we've not had these issues with our staging swarm, only production. We had a previous issue with our production swarm and it can't be ruled out that these two issues are related: https://github.com/docker/for-aws/issues/176

After initiating our upgrade and after the rollback, response from manager node commands, after upgrade manager failed to connect. At this time, there were 4 managers online, 3 of which were in working order before the upgrade and 1 was the newly spun up manager from the upgrade:

docker service ls
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.

example docker info on manager node before losing state:

Containers: 19
 Running: 19
 Paused: 0
 Stopped: 0
Images: 17
Server Version: 17.12.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: pvjkpiwuwo4pk13ap1oegmuw9
 Error: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
 Is Manager: true
 Node Address: 172.31.2.115
 Manager Addresses:
  172.31.11.173:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.59-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.785GiB
Name: ip-172-31-2-115.ec2.internal
ID: JSV7:GSSN:QRUE:SU7L:5VCQ:W2T7:N2K7:QKFZ:CHAP:XQ2M:NURT:SOBZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 237
 Goroutines: 331
 System Time: 2018-11-01T02:02:55.147615892Z
 EventsListeners: 15
Registry: https://index.docker.io/v1/
Labels:
 availability_zone=us-east-1a
 instance_type=m4.large
 node_type=manager
 os=linux
 region=us-east-1
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

example docker info after --force-new-cluster:

Containers: 11
 Running: 8
 Paused: 0
 Stopped: 3
Images: 17
Server Version: 17.12.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID:
 Is Manager: false
 Node Address: 172.31.2.115
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.59-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.785GiB
Name: ip-172-31-2-115.ec2.internal
ID: JSV7:GSSN:QRUE:SU7L:5VCQ:W2T7:N2K7:QKFZ:CHAP:XQ2M:NURT:SOBZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 163
 Goroutines: 174
 System Time: 2018-11-01T04:53:18.286400483Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
 availability_zone=us-east-1a
 instance_type=m4.large
 node_type=manager
 os=linux
 region=us-east-1
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

docker-diagnose from the above instance:

docker-diagnose
OK hostname=ip-172-31-33-147-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-44-136-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-11-173-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-2-115-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-18-10-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-47-234-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-8-9-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-25-80-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-7-90-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-23-20-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-6-7-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-26-88-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-40-143-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
Done requesting diagnostics.
Your diagnostics session ID is 1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
Please provide this session ID to the maintainer debugging your issue.

example docker info on manager node, having lost state, after subsequent --force-new-cluster:

Welcome to Docker!
~ $ docker info
Containers: 7
 Running: 6
 Paused: 0
 Stopped: 1
Images: 6
Server Version: 17.12.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: jxxovyqfxjj5pwf6ntndg9eil
 Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
 Is Manager: true
 Node Address: 172.31.11.173
 Manager Addresses:
  172.31.11.173:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.59-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.785GiB
Name: ip-172-31-11-173.ec2.internal
ID: U2VY:DYYQ:BPFN:6M2K:FUXI:A52Q:FNFF:53CT:54YS:P4WN:YSFJ:OQLE
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 137
 Goroutines: 214
 System Time: 2018-11-01T04:09:58.146358428Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
 node_type=manager
 os=linux
 region=us-east-1
 availability_zone=us-east-1a
 instance_type=m4.large
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

docker-diagnose from the above instance:

docker-diagnose
OK hostname=ip-172-31-33-147-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-44-136-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-11-173-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-2-115-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-18-10-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-47-234-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-8-9-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-25-80-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-7-90-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-23-20-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-6-7-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-26-88-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-40-143-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
Done requesting diagnostics.
Your diagnostics session ID is 1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
Please provide this session ID to the maintainer debugging your issue.

We have a backup of /var/lib/docker/swarm from a manager with desired service state, but without being able to access the host of one of our manager nodes, we have no way or restarting the docker daemon and attempting that method of recovery.

We would love any insight anyone has on what might have led to this happening. We're currently manually standing up a new production swarm and reinstating EFS and EBS backups for services. If anyone has insight on how we could more easily save the existing swarm, we'd love to hear it.