Open JeremyHutchings opened 3 years ago
I'm seeing this on a 3-node swarm (3 managers), all VM's. Intermittendly all connectivity to one node drops, no more ingress is possible.
Running debian Bullseye.
$ docker version
Client:
Version: 20.10.5+dfsg1
API version: 1.41
Go version: go1.15.9
Git commit: 55c4c88
Built: Wed Aug 4 19:55:57 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.5+dfsg1
API version: 1.41 (minimum version 1.12)
Go version: go1.15.9
Git commit: 363e9a8
Built: Wed Aug 4 19:55:57 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.5~ds1
GitCommit: 1.4.5~ds1-2+deb11u1
runc:
Version: 1.0.0~rc93+ds1
GitCommit: 1.0.0~rc93+ds1-5+b2
docker-init:
Version: 0.19.0
GitCommit:
$ docker info
Client:
Context: default
Debug Mode: false
Server:
Containers: 8
Running: 7
Paused: 0
Stopped: 1
Images: 24
Server Version: 20.10.5+dfsg1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: ifg4shc3fhlegb83nk6gjtoc5
Is Manager: true
ClusterID: pskiot9vwjp10zazx0jumybmr
Managers: 3
Nodes: 3
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 192.168.34.46
Manager Addresses:
192.168.34.46:2377
192.168.34.47:2377
192.168.34.48:2377
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 1.4.5~ds1-2+deb11u1
runc version: 1.0.0~rc93+ds1-5+b2
init version:
Security Options:
apparmor
seccomp
Profile: default
cgroupns
Kernel Version: 5.10.0-9-amd64
Operating System: Debian GNU/Linux 11 (bullseye)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 2.849GiB
Name: mediabox
ID: AFL7:KR2U:SOTJ:YWLM:MI5G:Z4L6:2IEF:GSLX:C2QN:BMSU:MNYI:56RR
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: jdeluyck
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Expected behaviour
That services in the swarm will always be able to accept routed requests and connections.
Actual behaviour
Intermittently services on nodes within the swam will not receive quests and will timeout
Steps to reproduce the behaviour
As per:
Without any logged errors a service on a node will just stop taking routed internal requests and will have to be drained, though restoring the service to that node doesn't help so it's an ever decreasing pool of resources that is avaiable in the docker swarm.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.)
Physical machines running Ubuntu 20.04.1 LTS