docker-archive / classicswarm

Swarm Classic: a container clustering system. Not to be confused with Docker Swarm which is at https://github.com/docker/swarmkit
Apache License 2.0
5.75k stars 1.08k forks source link

Every second request fails to route within Docker swarm #2855

Closed chrissound closed 4 years ago

chrissound commented 6 years ago

I'm not too sure how to investigate this issue, essentially I deployed a set of services to a swarm using a command like:

docker-compose -f docker-compose-swarm.yml config | docker stack deploy -c testing1.

This is a single node cluster (on a single machine).

Why is every second request failing to route?

telnet 127.0.0.1 8091
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
^C^C^CConnection closed by foreign host.

And on a second attempt:

telnet 127.0.0.1 8091
Trying 127.0.0.1...
telnet: Unable to connect to remote host: No route to host

Within the docker-compose-swarm.yml I have a network used and defined as:

networks:
  navigator-imagegallery-network:
    driver: overlay

tcpdump on first request (successful):

sudo tcpdump  -v -i any tcp port 8091
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
16:13:17.573503 IP (tos 0x0, ttl 64, id 1247, offset 0, flags [DF], proto TCP (6), length 60)
    archamd.60974 > 172.30.0.2.jamlink: Flags [S], cksum 0x586e (incorrect -> 0x078a), seq 831371862, win 43690, options [mss 65495,sackOK,TS val 2928057139 ecr 0,nop,wscale 7], length 0
16:13:17.573516 IP (tos 0x0, ttl 64, id 1247, offset 0, flags [DF], proto TCP (6), length 60)
    archamd.60974 > 172.30.0.2.jamlink: Flags [S], cksum 0x586e (incorrect -> 0x078a), seq 831371862, win 43690, options [mss 65495,sackOK,TS val 2928057139 ecr 0,nop,wscale 7], length 0
16:13:17.573603 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.30.0.2.jamlink > archamd.60974: Flags [S.], cksum 0x586e (incorrect -> 0x44e0), seq 623836494, ack 831371863, win 27960, options [mss 1410,sackOK,TS val 1733128342 ecr 2928057139,nop,wscale 7], length 0
16:13:17.573603 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.30.0.2.jamlink > localhost.localdomain.60974: Flags [S.], cksum 0x2b50 (incorrect -> 0x71fe), seq 623836494, ack 831371863, win 27960, options [mss 1410,sackOK,TS val 1733128342 ecr 2928057139,nop,wscale 7], length 0
16:13:17.573621 IP (tos 0x0, ttl 64, id 1248, offset 0, flags [DF], proto TCP (6), length 52)
    archamd.60974 > 172.30.0.2.jamlink: Flags [.], cksum 0x5866 (incorrect -> 0xdf5c), ack 1, win 342, options [nop,nop,TS val 2928057139 ecr 1733128342], length 0
16:13:17.573623 IP (tos 0x0, ttl 64, id 1248, offset 0, flags [DF], proto TCP (6), length 52)
    archamd.60974 > 172.30.0.2.jamlink: Flags [.], cksum 0x5866 (incorrect -> 0xdf5c), ack 1, win 342, options [nop,nop,TS val 2928057139 ecr 1733128342], length 0
16:13:17.573711 IP (tos 0x0, ttl 64, id 1249, offset 0, flags [DF], proto TCP (6), length 138)
    archamd.60974 > 172.30.0.2.jamlink: Flags [P.], cksum 0x58bc (incorrect -> 0xe5f0), seq 1:87, ack 1, win 342, options [nop,nop,TS val 2928057139 ecr 1733128342], length 86

tcpdump on second request (failure):

sudo tcpdump  -v -i any tcp port 8091
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
16:14:33.896311 IP (tos 0x0, ttl 64, id 17252, offset 0, flags [DF], proto TCP (6), length 60)
    archamd.32944 > 172.30.0.2.jamlink: Flags [S], cksum 0x586e (incorrect -> 0xeed3), seq 2620107720, win 43690, options [mss 65495,sackOK,TS val 2928133462 ecr 0,nop,wscale 7], length 0
16:14:33.896324 IP (tos 0x0, ttl 64, id 17252, offset 0, flags [DF], proto TCP (6), length 60)
    archamd.32944 > 172.30.0.2.jamlink: Flags [S], cksum 0x586e (incorrect -> 0xeed3), seq 2620107720, win 43690, options [mss 65495,sackOK,TS val 2928133462 ecr 0,nop,wscale 7], length 0
16:14:34.939687 IP (tos 0x0, ttl 64, id 17253, offset 0, flags [DF], proto TCP (6), length 60)
    archamd.32944 > 172.30.0.2.jamlink: Flags [S], cksum 0x586e (incorrect -> 0xeac0), seq 2620107720, win 43690, options [mss 65495,sackOK,TS val 2928134505 ecr 0,nop,wscale 7], length 0
16:14:34.939702 IP (tos 0x0, ttl 64, id 17253, offset 0, flags [DF], proto TCP (6), length 60)
    archamd.32944 > 172.30.0.2.jamlink: Flags [S], cksum 0x586e (incorrect -> 0xeac0), seq 2620107720, win 43690, options [mss 65495,sackOK,TS val 2928134505 ecr 0,nop,wscale 7], length 0
16:14:36.987820 IP (tos 0x0, ttl 64, id 17254, offset 0, flags [DF], proto TCP (6), length 60)
    archamd.32944 > 172.30.0.2.jamlink: Flags [S], cksum 0x586e (incorrect -> 0xe2bf), seq 2620107720, win 43690, options [mss 65495,sackOK,TS val 2928136554 ecr 0,nop,wscale 7], length 0
16:14:36.987829 IP (tos 0x0, ttl 64, id 17254, offset 0, flags [DF], proto TCP (6), length 60)
    archamd.32944 > 172.30.0.2.jamlink: Flags [S], cksum 0x586e (incorrect -> 0xe2bf), seq 2620107720, win 43690, options [mss 65495,sackOK,TS val 2928136554 ecr 0,nop,wscale 7], length 0
eishzar commented 6 years ago

I see the same issue.

chrissound commented 6 years ago

@eishzar could you point out any similarity in your config? What OS are you using? I'm on Linux.

davidk1977 commented 6 years ago

I also see the same issue. I am running a nginx container binding to port 80 . Testing the connection I get the following

first connection: time_namelookup: 0.068883 time_connect: 0.179350 time_appconnect: 0.000000 time_pretransfer: 0.179784 time_redirect: 0.000000 time_starttransfer: 0.399246

 time_total: 0.399968

Second connection: Failed to connect to port 80: Network is unreachable time_namelookup: 0.005579 time_connect: 0.000000 time_appconnect: 0.000000 time_pretransfer: 0.000000 time_redirect: 0.000000 time_starttransfer: 0.000000

 time_total: 5.223892

This is a single node swarm running nginx using the default overlay network. Running on OS - Ubuntu 16.04.2 LTS

Docker Server Version: 18.02.0-ce

eishzar commented 6 years ago

@chrissound I am also using a single swarm node Ubuntu server with overlay network like @davidk1977 . The docker version is 18.03.0-ce. The swarm stack has multiple services in the same overlay network and every second request regardless of which service is accessed fails with "no route to host" error.

chrissound commented 6 years ago

Good to know. I was also using a single node.

strawgate commented 6 years ago

I am having this same issue in Docker for Windows with swarm.

Every other request from Windows is failing with no response and I see the following errors in my docker logs:

[12:23:24.742][VpnKit ][Error ] vpnkit.exe: Hvsock.read: An established connection was aborted by the software in your host machine. [12:23:40.415][VpnKit ][Error ] vpnkit.exe: Hvsock.read: An established connection was aborted by the software in your host machine.

Within the docker swarm if I try to curl against the box i get a response every other time and when I do not get a response I get: curl: (7) Failed to connect to remotehost port 80: No route to host

strawgate commented 6 years ago

I rolled back to 17.12 and no longer have this issue.

I will try to come up with a minimum viable reproduction case -- looks like the criteria to cause this bug are:

  1. Container with an exposed port
  2. The container is part of a docker stack deployed with swarm
  3. The container goes down at least once

I think there is something else here but i have to try a couple more things. When this occurs, dns resolution in the container does not point to the container directly, it points to a VIP which correctly routes every other request to the destination container. If you hit the container directly by its IP (and not the VIP), 100% of requests complete successfully.

BretFisher commented 6 years ago

Hey ya'll, this is the wrong repo for SwarmKit. This is for Swarm "classic" that was replaced with Swarm Mode in 2016. Docker engine issues should all be added to moby/moby. Also if it's a VPNKit error on Docker for Windows, then it should go in the for-win repo.

Mobe91 commented 6 years ago

I think I found a corresponding existing issue that is already in the right place: https://github.com/moby/moby/issues/35671