We underwent an upgrade yesterday on our production swarm from 17.12.0 to 17.12.1, successfully.
After doing so, we attempted to upgrade the same swarm to 18.03.0. Early in the second migration, we noticed the newly created manager was unable to connect to the swarm. Upon noticing, we immediately halted and rolled back the migration.
Despite the rollback, our swarm was left in a state such that new managers could not reach quorum. The existing managers then decided there were not sufficient managers participating and stopped responding to docker node, docker service, etc, commands. Any new managers brought online would come online and consider themselves a single manager in a new swarm. Multiple --force-new-cluster commands on remaining managers that retained remnants of service state were unsuccessful. Eventually, our last manager with retained state lost state after a --force-new-cluster command. In some cases, managers could not communicate at all and only saw themselves as single instance swarms. In some, the IPs of a manager would end up in the list of another, but would still regard themselves as not communicating.
We noticed that most of the newly brought online managers had empty /var/lib/docker/swarm directories.
Information
Note, we've not had these issues with our staging swarm, only production. We had a previous issue with our production swarm and it can't be ruled out that these two issues are related: https://github.com/docker/for-aws/issues/176
After initiating our upgrade and after the rollback, response from manager node commands, after upgrade manager failed to connect. At this time, there were 4 managers online, 3 of which were in working order before the upgrade and 1 was the newly spun up manager from the upgrade:
docker service ls
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
example docker info on manager node before losing state:
Containers: 19
Running: 19
Paused: 0
Stopped: 0
Images: 17
Server Version: 17.12.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: pvjkpiwuwo4pk13ap1oegmuw9
Error: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Is Manager: true
Node Address: 172.31.2.115
Manager Addresses:
172.31.11.173:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.59-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.785GiB
Name: ip-172-31-2-115.ec2.internal
ID: JSV7:GSSN:QRUE:SU7L:5VCQ:W2T7:N2K7:QKFZ:CHAP:XQ2M:NURT:SOBZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 237
Goroutines: 331
System Time: 2018-11-01T02:02:55.147615892Z
EventsListeners: 15
Registry: https://index.docker.io/v1/
Labels:
availability_zone=us-east-1a
instance_type=m4.large
node_type=manager
os=linux
region=us-east-1
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
docker-diagnose
OK hostname=ip-172-31-33-147-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-44-136-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-11-173-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-2-115-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-18-10-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-47-234-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-8-9-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-25-80-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-7-90-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-23-20-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-6-7-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-26-88-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
OK hostname=ip-172-31-40-143-ec2-internal session=1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
Done requesting diagnostics.
Your diagnostics session ID is 1541090209-Fn4XS5neJp8RWUefOqSIldrRWBz4Fjq5
Please provide this session ID to the maintainer debugging your issue.
example docker info on manager node, having lost state, after subsequent --force-new-cluster:
docker-diagnose
OK hostname=ip-172-31-33-147-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-44-136-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-11-173-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-2-115-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-18-10-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-47-234-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-8-9-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-25-80-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-7-90-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-23-20-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-6-7-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-26-88-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
OK hostname=ip-172-31-40-143-ec2-internal session=1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
Done requesting diagnostics.
Your diagnostics session ID is 1541090273-8VenRkzE5xKUsaInKZjVOxwXtLNdIit0
Please provide this session ID to the maintainer debugging your issue.
We have a backup of /var/lib/docker/swarm from a manager with desired service state, but without being able to access the host of one of our manager nodes, we have no way or restarting the docker daemon and attempting that method of recovery.
We would love any insight anyone has on what might have led to this happening. We're currently manually standing up a new production swarm and reinstating EFS and EBS backups for services. If anyone has insight on how we could more easily save the existing swarm, we'd love to hear it.
Expected behavior
Simple and easy swarm upgrades. ™
Actual behavior
We underwent an upgrade yesterday on our production swarm from 17.12.0 to 17.12.1, successfully. After doing so, we attempted to upgrade the same swarm to 18.03.0. Early in the second migration, we noticed the newly created manager was unable to connect to the swarm. Upon noticing, we immediately halted and rolled back the migration.
Despite the rollback, our swarm was left in a state such that new managers could not reach quorum. The existing managers then decided there were not sufficient managers participating and stopped responding to
docker node
,docker service
, etc, commands. Any new managers brought online would come online and consider themselves a single manager in a new swarm. Multiple--force-new-cluster
commands on remaining managers that retained remnants of service state were unsuccessful. Eventually, our last manager with retained state lost state after a--force-new-cluster
command. In some cases, managers could not communicate at all and only saw themselves as single instance swarms. In some, the IPs of a manager would end up in the list of another, but would still regard themselves as not communicating.We noticed that most of the newly brought online managers had empty
/var/lib/docker/swarm
directories.Information
Note, we've not had these issues with our staging swarm, only production. We had a previous issue with our production swarm and it can't be ruled out that these two issues are related: https://github.com/docker/for-aws/issues/176
After initiating our upgrade and after the rollback, response from manager node commands, after upgrade manager failed to connect. At this time, there were 4 managers online, 3 of which were in working order before the upgrade and 1 was the newly spun up manager from the upgrade:
example
docker info
on manager node before losing state:example
docker info
after--force-new-cluster
:docker-diagnose from the above instance:
example
docker info
on manager node, having lost state, after subsequent--force-new-cluster
:docker-diagnose from the above instance:
We have a backup of
/var/lib/docker/swarm
from a manager with desired service state, but without being able to access the host of one of our manager nodes, we have no way or restarting the docker daemon and attempting that method of recovery.We would love any insight anyone has on what might have led to this happening. We're currently manually standing up a new production swarm and reinstating EFS and EBS backups for services. If anyone has insight on how we could more easily save the existing swarm, we'd love to hear it.