docker / for-linux

Docker Engine for Linux
https://docs.docker.com/engine/installation/
753 stars 85 forks source link

Intermittently not accepting connections in docker swarm #1231

Open JeremyHutchings opened 3 years ago

JeremyHutchings commented 3 years ago

Expected behaviour

That services in the swarm will always be able to accept routed requests and connections.

Actual behaviour

Intermittently services on nodes within the swam will not receive quests and will timeout

Steps to reproduce the behaviour

As per:

Without any logged errors a service on a node will just stop taking routed internal requests and will have to be drained, though restoring the service to that node doesn't help so it's an ever decreasing pool of resources that is avaiable in the docker swarm.

Output of docker version:

Docker version 19.03.8, build afacb8b7f0

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 51
  Running: 31
  Paused: 0
  Stopped: 20
 Images: 362
 Server Version: 19.03.8
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: 3y7l7c5dl33wmcabojcn470s3
  Is Manager: true
  ClusterID: r4gu75a9dus7zzxvwpdh1zjll
  Managers: 5
  Nodes: 6
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.0.3.11
  Manager Addresses:
   10.0.1.11:2377
   10.0.1.12:2377
   10.0.2.11:2377
   10.0.3.11:2377
   10.0.3.12:2377
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-29-generic

Additional environment details (AWS, VirtualBox, physical, etc.)

Physical machines running Ubuntu 20.04.1 LTS

jdeluyck commented 2 years ago

I'm seeing this on a 3-node swarm (3 managers), all VM's. Intermittendly all connectivity to one node drops, no more ingress is possible.

Running debian Bullseye.

$ docker version 
Client:
 Version:           20.10.5+dfsg1
 API version:       1.41
 Go version:        go1.15.9
 Git commit:        55c4c88
 Built:             Wed Aug  4 19:55:57 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.5+dfsg1
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.15.9
  Git commit:       363e9a8
  Built:            Wed Aug  4 19:55:57 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.5~ds1
  GitCommit:        1.4.5~ds1-2+deb11u1
 runc:
  Version:          1.0.0~rc93+ds1
  GitCommit:        1.0.0~rc93+ds1-5+b2
 docker-init:
  Version:          0.19.0
  GitCommit:        
$ docker info
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 8
  Running: 7
  Paused: 0
  Stopped: 1
 Images: 24
 Server Version: 20.10.5+dfsg1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: ifg4shc3fhlegb83nk6gjtoc5
  Is Manager: true
  ClusterID: pskiot9vwjp10zazx0jumybmr
  Managers: 3
  Nodes: 3
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 192.168.34.46
  Manager Addresses:
   192.168.34.46:2377
   192.168.34.47:2377
   192.168.34.48:2377
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1.4.5~ds1-2+deb11u1
 runc version: 1.0.0~rc93+ds1-5+b2
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.10.0-9-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 2.849GiB
 Name: mediabox
 ID: AFL7:KR2U:SOTJ:YWLM:MI5G:Z4L6:2IEF:GSLX:C2QN:BMSU:MNYI:56RR
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: jdeluyck
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false