swarms created on 19.03 can not be accessed from machines on same network

jscottnz commented 4 years ago

[x] This is a bug report
[ ] This is a feature request
[X ] I searched existing issues before opening this one

Expected behavior

Swarm created should be accessible (curl 10.0.0.50:8080) from machine on same network

Actual behavior

Swarm created is not accessible (curl 10.0.0.50:8080) from machine on same network.

Steps to reproduce the behavior

This problem involves two machines, ( 10.0.0.50 ) a docker host and any other machine on the 10.0.0.x network, ie a load balancer or jumphost

On centos 7 all updated and patched, on a vm on a cloud platform, follow docker installing guide for version 18.

Run nginx as a swarm service:

docker swarm init
docker service create --name nginx --publish published=8080,target=80 nginx

Test and note that nginx is accessible (curl 10.0.0.50:8080) from another host on the same network.

Upgrade docker to version 19.

Test and note that nginx is accessible (curl 10.0.0.50:8080) from another host on the same network.

Destroy the swarm:

docker swarm leave --force

Run nginx as a swarm service:

docker swarm init
docker service create --name nginx --publish published=8080,target=80 nginx

Test and note that nginx is NOT accessible (curl 10.0.0.50:8080) from another host on the same network.

This behaviour can also be reproduced with a fresh installation of version 19.

You can also uninstall docker-ce 19 and install 18. The swarm created in 19 is still not accessible. If you remove the swarm and create it (in version 18) it is accessible.

Output of docker version:

Client: Docker Engine - Community
 Version:           19.03.11
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        42e35e61f3
 Built:             Mon Jun  1 09:13:48 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.11
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       42e35e61f3
  Built:            Mon Jun  1 09:12:26 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 19.03.11
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: ccngtgfpe0quhr0qhe9r9x1id
  Is Manager: true
  ClusterID: t3utljatelbhs3up8p783658j
  Managers: 1
  Nodes: 1
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.0.0.50
  Manager Addresses:
   10.0.0.50:2377
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1062.9.1.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 7.638GiB
 Name: swarm4.novalocal
 ID: P4WR:IS7N:G47I:HEM4:XPHM:DM63:7SBP:7OM5:DHTA:GFS5:5H6K:H6OD
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: bridge-nf-call-ip6tables is disabled

Additional environment details (AWS, VirtualBox, physical, etc.) Running in a data centre on vms.

thaJeztah commented 4 years ago

10.0.0.3 looks to be the address of the container on the internal container-container network; are you cable to access the container using the IP address of the host? (10.0.0.50:8080) ?

jscottnz commented 4 years ago

Hi there. Sorry, that was a copy and paste error. I've updated the issue. 10.0.0.50 is the docker host. 10.0.0.3 is another non-docker host on the 10.0.0.x network, such has a load balancer or jumphost

(jumphost) 10.0.0.3 -> (docker host) 10.0.0.50:8080 -> (container) nginx:80

letal1609 commented 4 years ago

I have tried to create a service with a published port on all "19.03" docker minor versions. The service created is reachable on docker 19.03.04. But from 19.03.05, the service is not reachable.

rwdim commented 4 years ago

@thaJeztah Any progress here? Destroying and rebuilding the swarm really isn't something I want to do, and the fact that this is a clear stopper and has not been assigned is making me feel like the internal expectation is that Mirantis will solve this... (not holding breath on that one)

I'm happy to jump into the code and see if I can find the issue if it's looking like it will get solved soon..

R

rwdim commented 4 years ago

Found it... give me a few minutes to confirm the fix (it IS working, but not sure why the patch didn't fix the real issue)..

R

rwdim commented 4 years ago

This appears a result of https://github.com/docker/for-linux/issues/810 , and while the original was closed as "known"??? The workaround presented by @andrewhsu works, but leaves me scratching my head... as it should be part of the stack config when docker comes up (or shortly after)..

The original release note is as follows:

## 19.03.3 (2019-10-07)
### Known Issues
- `DOCKER-USER` iptables chain is missing [docker/for-linux#810](https://github.com/docker/for-linux/issues/810). Users cannot perform additional container network traffic filtering on top of this iptables chain. You are not affected by this issue if you are not customizing iptables chains on top of `DOCKER-USER`.
  Workaround is to insert the iptables chain after docker daemon starts.

  iptables -N DOCKER-USER
  iptables -I FORWARD -j DOCKER-USER
  iptables -A DOCKER-USER -j RETURN

if you run this as root on all your nodes, the issue is resolved but may expose you to other issues... The notes around it imply there may be a firewall bypass issue if you do this...

  iptables -N DOCKER-USER
  iptables -I FORWARD -j DOCKER-USER
  iptables -A DOCKER-USER -j RETURN

Maybe it's my eyes, but it looks like the fix (https://github.com/moby/libnetwork/pull/2464) done by @arkodg and PR'd was closed and never accepted?

I briefly reviewed the code in https://github.com/moby/libnetwork/pull/2470/commits/8cdd5a34cf0d31c3d0b18442ff7cd745386da612#diff-e30be89bfd41a0c219178028b9971a32 which appears to be an attempt to integrate the functionality of @andrewhsu PR, but it appears to me (and I'm no GO expert), that the check falls short, looking to see if the DOCKER-USER entry is there, but not ensuring the other two entries are also there... The first without the others doesn't solve the problem.

If this IS the case, any test of this could must check for both cases: DOCKER-USER not present, and DOCKER-USER present but misconfigured.

Please... correct me if I'm wrong (and I probably am since Dockers guts isn't my thing)...

R

rwdim commented 4 years ago

The results above ended up being pretty spotty... sometimes it works, sometimes it doesn't, and I couldn't find a common reason why it did or didn't work. Ultimately, it looks pretty arbitrary.

As an alternative, I did revert back to 19.03.04 and everything works perfectly as far as I can see..

Didn't have to leave the swarm or re-create it. Just ran this and everything seems to be working fine..

sudo apt remove -y docker-ce docker-ce-cli
sudo apt install -y docker-ce=5:19.03.4~3-0~ubuntu-bionic docker-ce-cli=5:19.03.4~3-0~ubuntu-bionic
sudo apt autoremove -y
sudo reboot

Since it looks like @thaJeztah merged the change (not certain) that fixed the DOCKER-USER issue in 19.03.04, perhaps reviewing subsequent builds to see if the changes pushed between 19.03.04 and 19.03.05-beta1 were removed or obviated for some reason.

docker / for-linux