docker / for-win

Bug reports for Docker Desktop for Windows
https://www.docker.com/products/docker#/windows
1.85k stars 287 forks source link

Hybrid swarm routing mesh not working #3052

Closed stevenmiller closed 5 years ago

stevenmiller commented 5 years ago

Howdy!

I have a two node hybrid swarm, with one Ubuntu Linux 18.04 node acting as manager and one Windows 2019 node participating as a worker. We have a sample environment we are testing which are entirely Windows containers. We are able to reach the published port for these services from the Windows node (ex. 10.20.1.121:8888) but we are unable to from the Linux node (ex. 10.20.1.122:8888). Our understanding of the ingress mesh networking is that this should work fairly seamlessly as it will route traffic within the swarm to the correct node.

This issue is also occurring with the swarmpit UI stack we have deployed to the swarm. The UI (10.20.1.122:888) is not able to be reached from the Windows node IP (10.20.1.121:888).

Expected behavior

Services running on one node should be accessible from published port on another node

Actual behavior

Published Windows container services are not accessible from linux node and vice versa

Information

Published services and ports:

image

Both the environment we have pulled from a private repo and the swarmpit stack exhibit this behavior. I am unable to reach the swarmpit UI from port 888 on the Windows node IP.

Steps to reproduce the behavior

Here is the stack with our private repo removed:

teststack.zip

Let me know if I can provide any other info.

olljanat commented 5 years ago

@stevenmiller first of I can see that this is your first issue on GitHub so let me say welcome :)

Then some side comments (which you don't need to comment but maybe it is good idea to check them out):

Then what comes to this issue, IMO your stack file looks to be just like documentation says that how it should work but in real life it looks that Windows implementation of overlay network / components used behind it (example: hcsshim) have some undocumented weaknesses.

stevenmiller commented 5 years ago

@olljanat Thanks for the tips! We are revising the apps to run what we can on linux containers. At least one service requires Windows so we are stuck with it for the time being. Is there any further troubleshooting you can advise with this current setup or is likely just an issue we will have to wait out?

stevenmiller commented 5 years ago

Just as a note: with my org's Microsoft SA agreement we get support, so I have a technical support case open regarding this issue. I will update with the resolution (if one is found).

olljanat commented 5 years ago

At least one service requires Windows so we are stuck with it for the time being. Is there any further troubleshooting you can advise with this current setup or is likely just an issue we will have to wait out?

Depending on your application architecture (which apps need to be able to talk which apps) you can example place Windows and Linux containers to same overlay network so they can talk with each others inside of it.

Just as a note: with my org's Microsoft SA agreement we get support, so I have a technical support case open regarding this issue. I will update with the resolution (if one is found).

OK. It is really interesting to hear if they are solve this issue as my earlier experiences of level of Microsoft support have not been too positive.

stevenmiller commented 5 years ago

Here's the final email from MS support. Unfortunately no resolution at this time, however we did find that if you deploy a container and expose the default port (say 80 for IIS) it is reachable from both nodes without a problem. It seems to be an issue with the port translation between hosts. :

lab setup

PS C:\Users\Administrator> docker version Client: Version: 18.09.0 API version: 1.39 Go version: go1.10.3 Git commit: 33a45cd0a2 Built: unknown-buildtime OS/Arch: windows/amd64 Experimental: false

// create overlay network PS C:\Users\TEMP> docker network create --driver=overlay overlay-nw vqixaimowmg9ch6byz089jmph

PS C:\Users\TEMP> docker network ls NETWORK ID NAME DRIVER SCOPE 08ick3kig81k ingress overlay swarm 4e5485733a1d nat nat local 8ab446910e24 none null local vqixaimowmg9 overlay-nw overlay swarm

// Init Swarm on manager node PS C:\Users\TEMP> docker swarm init --advertise-addr=10.168.179.56 --listen-addr 10.168.179.56:2377

// Join a swarm as a work on Windows worker node PS C:\Users\Administrator.WUXI> docker swarm join --token SWMTKN-1-5lazzfmd1wyft4egtnzope30i1n744qlw99bj79qtd4ccdv2hg-b4j4fkki75ud34c9n5fvmnffw 10.168.179.56:2377

// Join a swarm as a work on Linux worker node admin@ubuntu-6011:~$ docker swarm join --token SWMTKN-1-5lazzfmd1wyft4egtnzope30i1n744qlw99bj79qtd4ccdv2hg-b4j4fkki75ud34c9n5fvmnffw 10.168.179.56:2377

PS C:\Users\TEMP> docker node ls ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION e99kxpcgwhkgtte7dwwko5pl0 * WS2019 Ready Active Leader 18.09.0 whjthcivqkvpj5oz45w6xg3de WS2019-B Ready Active 18.09.0 7mpt4wn2ioi4atmtmkxdu7q6u ubuntu-6011 Ready Active 18.09.0

// create service on master node docker service create --name=test --endpoint-mode vip --network=overlay-nw microsoft/iis cmd docker service update --publish-add published=8080,target=80 test

Problem repro and analysis

The test result shows the similar packet flow with the customer packet. The problem is that Linux node didn’t translate the published port 8080 to target port 80.

Failed Packet flow: Client -> Linux 10.168.179.42:8080[10.255.0.4] -> Windows 10.168.179.56:8080 [10.255.0.3:8080]

Success Packet flow: Client -> Windows 10.168.179.55:8080[10.255.0.3] -> Windows 10.168.179.56:80 [10.255.0.3:80]

Summary

As our test result and network traffic analysis, we found after deploying docker swarm mixed with Windows node and Linux node, we publish IIS service with 808080, when we visit the site with Windows nodes on port 8080, it will transfer to IIS container port 80, however, when we visit the site with Linux node on port 8080, it didn’t translate to the real IIS container port 80.

Suggestion

After discussing and according to the log analysis result, it seems there are some compatibility issues on the Linux node to translate the published port to the real container port. Since it may related with Linux system compatibility with docker swarm ingress network, it’s recommended to contact Linux vendor to look into the issue deeply.

olljanat commented 5 years ago

Btw is this related to moby/moby#38484 ?

docker-robott commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale comment. Stale issues will be closed after an additional 30d of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale

docker-robott commented 4 years ago

Closed issues are locked after 30 days of inactivity. This helps our team focus on active issues.

If you have found a problem that seems similar to this, please open a new issue.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle locked