Open kylewuolle opened 6 years ago
One thing I've discovered via debugging is that a change introduced in this commit might be responsible https://github.com/docker/libnetwork/commit/5008b0c26d107917fb74aaaefe2e8e938358cd10
If line 259 of controller.go is changed to simply be
if provider == nil {
return
}
Then the problem goes away. This is because at some point the agent is stopped and it's never restarted in the case of swarm init with force new cluster. Maybe there could be some other way to prevent this race condition? Maybe checking to see if the agent is really active? I will do more digging and add what I find.
the libnetwork fix was included in Docker 18.09.4 through https://github.com/docker/engine/pull/169 ; should this one be closed?
oh, sorry, it was not yet in 18.09; cherry-picking now
Hello everyone, I'm wondering if this issue is really resolved as I seem to be facing the same kind of name resolution problem after issuing a "docker swarm init --force-new-cluster" on an "isolated" manager.
One big difference in my scenario is that I'm NOT deploying services thru Swarm, I'm deploying containers thru classic docker-compose and just make use of an overlay network managed by Swarm onto which I'm attaching containers in docker-compose. Basically my setup is 2 nodes joined in with both manager role, and an overlay network created manually with the "--attachable" flag. Then on the 2 nodes I'm starting some containers using a simple docker-compose deployment (no swarm/service deploy), but attaching them to the overlay network I've created.
Things are working fine, containers are able to communicate, but now on the 2 manager nodes, let's say one fails. On the one that survive, all containers seem to be still running fine even if docker swarm is in "no quorum/isolated" state (error message "The swarm does not have a leader" in reply to swarm commands).
At this point I have to "docker swarm init --force-new-cluster" on the survivor, but as soon as I issue the command, I can see in containers logs that they become unable to resolve each others names (I get "Name or service not known" errors). And the DNS resolution seems to be broken forever, even if the previously failed node gets restored and join again the swarm, at this point the only solution is to restart the whole docker-compose stack on the survivor. Weird thing is that on the new node that just joined in replacement of the previously failed one, things are going fine.
Based on my tests, the name resolution only works again when I restart the container I'm trying to resolve on the survived node. It looks like if on startup the container was somehow registering itself again on the "new" swarm overlay network that was recreated when I issued the "force-new-cluster" command.
Here's an example of the issue. Just after issuing the "force-new-cluster" command on the survivor, on the survived containers I can't resolve any of the other containers names:
[root]# docker-compose exec -u root one-container bash
root@one-container:/# ping another-container
ping: another-container: Name or service not known
Now if I just restart "another-container":
[root]# docker-compose restart another-container
Restarting another-container ... done
From the first one name resolution works again:
[root]# docker-compose exec -u root one-container bash
root@one-container:/# ping another-container
PING another-container (172.20.0.15) 56(84) bytes of data.
64 bytes from awq02-master-another-container.my-overlay-network (172.20.0.15): icmp_seq=1 ttl=64 time=0.034 ms
64 bytes from awq02-master-another-container.my-overlay-network (172.20.0.15): icmp_seq=2 ttl=64 time=0.046 ms
64 bytes from awq02-master-another-container.my-overlay-network (172.20.0.15): icmp_seq=3 ttl=64 time=0.034 ms
^C
--- another-container ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.034/0.038/0.046/0.005 ms
Any idea if this issue could be related to the fact that I'm just attaching containers to the overlay network using docker-compose and not really managing them thru plain swarm ?
Thanks for your time !
Expected behavior
After a --force-new-cluster and subsequently adding a new node to the cluster the tasks.servicename should be resolved by internal docker dns and containers on the same overlay network should be able to reach each other.
Actual behavior
On the node on which --force-new-cluster was executed the tasks.servicename endpoint will not resolve. On the added node, the tasks.servicename does resolve but it will only resolve to the container on the one node. Also, the containers on the same overlay network cannot reach each other by their ips.
Steps to reproduce the behavior
RUN apt update RUN apt install dnsutils -y
CMD /bin/bash -c "while true; do nslookup tasks.demo; sleep 2; done"
`Client: Version: 18.03.1-ce API version: 1.37 Go version: go1.9.5 Git commit: 9ee9f40 Built: Wed Jun 20 21:43:51 2018 OS/Arch: linux/amd64 Experimental: false Orchestrator: swarm
Server: Engine: Version: 18.03.1-ce API version: 1.37 (minimum version 1.12) Go version: go1.9.5 Git commit: 9ee9f40 Built: Wed Jun 20 21:42:00 2018 OS/Arch: linux/amd64 Experimental: false
Containers: 1 Running: 1 Paused: 0 Stopped: 0 Images: 8 Server Version: 18.03.1-ce Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: active NodeID: itrdsuwlqi234atk1nwc8foha Is Manager: true ClusterID: ysq5qap98z4gbilfi4z3o60j3 Managers: 2 Nodes: 2 Orchestration: Task History Retention Limit: 5 Raft: Snapshot Interval: 10000 Number of Old Snapshots to Retain: 0 Heartbeat Tick: 1 Election Tick: 10 Dispatcher: Heartbeat Period: 5 seconds CA Configuration: Expiry Duration: 3 months Force Rotate: 0 Autolock Managers: false Root Rotation In Progress: false Node Address: 10.138.0.16 Manager Addresses: 10.138.0.11:2377 35.227.182.132:2377 Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88 runc version: 4fc53a81fb7c994640722ac585fa9ca548971871 init version: 949e6fa Security Options: apparmor seccomp Profile: default Kernel Version: 4.15.0-1024-gcp Operating System: Ubuntu 18.04.1 LTS OSType: linux Architecture: x86_64 CPUs: 1 Total Memory: 3.607GiB Name: instance-9 ID: TEPO:ELY7:EYOT:LPCS:OQ4B:DKKA:FK2U:XJ52:RXF7:7CGN:GEXO:YLAN Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
WARNING: No swap limit support