docker-archive / dockercloud-agent

Agent to manage docker in nodes controlled by Docker Cloud
https://cloud.docker.com/
Apache License 2.0
34 stars 21 forks source link

BYON fails to use reverse tunnel #18

Open crockpotveggies opened 8 years ago

crockpotveggies commented 8 years ago

Environment: Ubuntu Trusty 14.04 on agent 1.1.0.

Using Docker Cloud Bring Your Own Node (BYON). We have some private infrastructure set up behind a router, and none of our nodes have a public IP. The firewall is turned off in each node.

It would appear that the dockercloud-agent is unable to set up a reverse tunnel from our private infrastructure. I've browsed the documentation extensively to determine the settings for our nodes, and as far as I can see we are following everything properly. The only difference is that each node does not have a public IP. Can someone please help clarify what settings need to be in place for a setup like ours?

Below is an example of a typical BYON setup:

2016/06/22 11:13:48 UUID has been changed from  to blahblahabc123
2016/06/22 11:13:48 Updating configuration file...
2016/06/22 11:13:48 New TLS certificates generated
2016/06/22 11:13:48 Registering in Docker Cloud via PATCH: https://cloud.docker.com/api/agent/v1/node/blahblahabc123
2016/06/22 11:13:49 Downloading docker binary...
2016/06/22 11:13:49 Downloading docker definition from https://cloud.docker.com/api/tutum/v1/agent/docker/1.9.1-cs2/1.1.0.json
2016/06/22 11:13:49 Downloading docker from https://files.cloud.docker.com/packages/docker/docker-1.9.1-cs2.tgz
2016/06/22 11:13:50 Saving docker to /usr/bin/
2016/06/22 11:13:50 Uncompressing: /usr/bin/._docker
2016/06/22 11:13:50 Uncompressing: /usr/bin/docker
2016/06/22 11:13:51 Found docker: version 1.9.1-cs2
2016/06/22 11:13:51 Initializing docker daemon
2016/06/22 11:13:51 Loading NAT tunnel module
2016/06/22 11:13:51 Verifying the registration with Docker Cloud
2016/06/22 11:13:51 Docker server started. Entering maintenance loop
2016/06/22 11:13:51 Waiting for docker unix socket to be ready
2016/06/22 11:13:51 Starting docker daemon: [/usr/bin/docker daemon -H unix:///var/run/docker.sock -H tcp://0.0.0.0:2375 --userland-proxy=false --tlscert /etc/dockercloud/agent/cert.pem --tlskey /etc/dockercloud/agent/key.pem --tlscacert /etc/dockercloud/agent/ca.pem --tlsverify]
2016/06/22 11:13:51 Docker daemon (PID:2598) has been started
2016/06/22 11:13:53 Docker unix socket opened
2016/06/22 11:13:53 Node blahblahabc123.node.dockerapp.io is publicly reachable
2016/06/22 11:18:55 Node registration to https://cloud.docker.com/ succeeded

Note above what we had to do here was deploy the node, and it appears the NAT tunnel for a moment started to set itself up. However after waiting 5 minutes, we opened 2375/tcp from our router only to then have the node publicly reachable. However, even after we close 2375/tcp Docker Cloud thinks the node is unreachable but the node itself fails to detect this in state. If we don't open port 2375, Docker Cloud will instead auto-terminate the node.

Thanks for your help!

crockpotveggies commented 8 years ago

I've narrowed the problem to this specific function: https://github.com/docker/dockercloud-agent/blob/master/agent/tunnel.go#L28

The agent incorrectly assumes the node is publicly available. For whatever reason, isNodeReachable returns true in my case (likely because no firewall exists but port is ultimately blocked by router).

To double-check the problem, I purged and re-installed the agent again to find out it thinks it's publicly available as seen by the log (and not ultimately downloading ngrok):

2016/06/22 16:52:47 Found docker: version 1.9.1-cs2
2016/06/22 16:52:47 Initializing docker daemon
2016/06/22 16:52:47 Loading NAT tunnel module
2016/06/22 16:52:47 Verifying the registration with Docker Cloud
2016/06/22 16:52:47 Docker server started. Entering maintenance loop
2016/06/22 16:52:47 Waiting for docker unix socket to be ready
2016/06/22 16:52:47 Starting docker daemon: [/usr/bin/docker daemon -H unix:///var/run/docker.sock -H tcp://0.0.0.0:2375 --userland-proxy=false --tlscert /etc/dockercloud/agent/cert.pem --tlskey /etc/dockercloud/agent/key.pem --tlscacert /etc/dockercloud/agent/ca.pem --tlsverify]
2016/06/22 16:52:47 Docker daemon (PID:31841) has been started
2016/06/22 16:52:49 Docker unix socket opened
2016/06/22 16:52:49 Node abc123.node.dockerapp.io is publicly reachable
2016/06/22 16:57:48 Node registration to https://cloud.docker.com/ timed out
2016/06/22 16:57:48 Node state: Deploying
crockpotveggies commented 8 years ago

Created a PR https://github.com/docker/dockercloud-agent/pull/20 tested on Ubuntu 14.04 in offending environment, confirmed it works.

tifayuki commented 8 years ago

@crockpotveggies

When a node is registered to dockercloud, dockercloud tries to connect public_ip_of_the_node:2375 to see if the node is publicly reachable. If the port is connectable, dockercloud tells the agent that the node is publicly reachable and the tunnel is not going to be created.

In you case, if you open the port on the router, there will be no tunnels created, as public_ip:2375 is always reachable.

If you close the port 2375 on the router and all the nodes should work as expected.

crockpotveggies commented 8 years ago

I can confirm you are right, permanently closing port 2375 on the router forced the agent to tunnel. Isn't this unexpected behaviour though? Shouldn't the node recognize its own UUID before it assumes that it is publicly available?

billiegoose commented 7 years ago

Yes, this is an issue! So, if my BYON is behind a NAT, and I don't have access to the router to permanently close port 2375, how else can I force dockercloud-agent to setup a tunnel?

billiegoose commented 7 years ago

Would installing a firewall (say ufw) on the node do the trick?

crockpotveggies commented 7 years ago

@wmhilton any way you can block traffic to the node (and ensure Docker doesn't accidentally contact another node hosting the agent on that port) will solve your problem. UFW may solve this issue, as long as the port traffic isn't redirected to another node running the agent (which mistakenly responds as the "new" node).

billiegoose commented 7 years ago

Great, blocking port 2375 with ufw does trick dockercloud-agent into opening a tunnel. However it appears the virtual LAN is not smart enough to also use the tunnel. The node itself is connected to Docker Cloud via the ngrok tunnel, but the applications inside the containers are unable to network. (DNS queries time out.) But that IS progress! Thank you very much @crockpotveggies

crockpotveggies commented 7 years ago

Sounds like you may need to open up some ports for ufw. The underlying network mechanism in Docker Cloud is weave so maybe you can get some help via the docs in that link. In my own BYON setup I typically left the local hosts wide open (no ufw) since they were in a protected, firewalled environment.