mRoca commented 6 years ago

Expected behavior

Be able to communicate between an AWS Docker Swarm stack and a 172.17.0.0/16 VPC.

Actual behavior

Situation:

A first VPC (vpc_a) has a 172.17.0.0/16 CIDR
Another VPC (vpc_d), the Docker Swarm one, has a 10.3.0.0/16 CIDR
A route table exists between the vpc_a and the vpc_d VPCs, and the vpc_d can access the vpc_a ones
The swarm stack is created on the vpc_d with the default CloudFormation configuration for existing vpc

The problem : it's impossible to access a vpc_a (172.17.xxx.xxx) ip from a swarm node.

By default, a new docker network has a 172.xxx.0.1/16 subnet, where xxx is the first available range after 17. When a swarm node (or a manager) is created, the docker engine creates a default bridge network. By default, this network's subnet is 172.17.0.1/16. During the swarm install a docker_gwbridge bridge network is created by the docker4x/init-aws container. By default, this network's subnet is 172.18.0.1/16. When we create a new docker network, the subnet is then 172.19.0.1/16.

When a docker container tries to access a 172.17.xxx.xxx ip, as the host has the 172.17.0.0/16 dev 17 src 172.17.0.1 ip route, the packets will newer leave the docker network.

Some solutions

The first and ugly solution: add a new proxy instance

In order to avoid the CloudFormation template or the Docker Swarm AMI update, it's possible to create a new "proxy" instance in the vpc_d vpc and to use its 10.3.xxx.xxx ip address instead of the true vpc_a one. The proxy can be created with a simple iptables rule : iptables -t nat -A PREROUTING -p tcp -i eth0 --dport ${port} -j DNAT --to ${vpc_a_ip}:${port} for example.

This solution is very heavy and not easy to manage.

A better solution: update the CloudFormation template

It's easy to congfigure the bridge docker network by adding the bip value in the /etc/docker/daemon.json file. This can allow to change its default 172.17.0.1/16 subnet value. But the problem remains the same, as all the first created network (the docker_gwbridge one, in our case) will take this available subnet.

The solution we have found is to "burn" one ip address by reserved range in the ip route table in order to avoid a docker network subnet creation on it. For example, by running the command iproute add 172.18.255.254 dev lo; before the swarm init, the docker_gwbridge network will have the 172.19.0.1/16 subnet, as the 172.18.0.1/16 is considered as yet used.

This is a working version of the UserData CloudFormation template's script value :

@@ -1491,8 +1526,12 @@
  "echo \"localhost: $EXTERNAL_LB\" >> /var/lib/docker/editions/elb.config\n",
  "echo \"default: $EXTERNAL_LB\" >> /var/lib/docker/editions/elb.config\n",
  "\n",
+ "sh -c \"$PRE_INIT_SCRIPT\"\n",
+ "\n",
  "echo '{\"experimental\": '$DOCKER_EXPERIMENTAL', \"labels\":[\"os=linux\", \"region='$NODE_REGION'\", \"availability_zone='$NODE_AZ'\", \"instance_type='$INSTANCE_TYPE'\", \"node_type='$NODE_TYPE'\" ]' > /etc/docker/daemon.json\n",
  "\n",
+ "echo ', \"bip\": \"'$DOCKER_BRIDGE_BIP'\"' >> /etc/docker/daemon.json\n",
+ "\n",
  "if [ $ENABLE_CLOUDWATCH_LOGS == 'yes' ] ; then\n",
  "   echo ', \"log-driver\": \"awslogs\", \"log-opts\": {\"awslogs-group\": \"'$LOG_GROUP_NAME'\", \"tag\": \"{{.Name}}-{{.ID}}\" }}' >> /etc/docker/daemon.json\n",
  "else\n",
 "   echo ' }' >> /etc/docker/daemon.json\n",
 "fi\n",
 "\n",
 "chown -R docker /home/docker/\n",
 "chgrp -R docker /home/docker/\n",
 "rc-service docker restart\n",

With the following new CloudFormation template parameters :

PRE_INIT_SCRIPT="iproute add 172.17.255.254 dev lo;"
DOCKER_BRIDGE_BIP="172.18.0.1/16"

Here, we must add the ip route AND specify the bip value because at this time the docker network has yet been created.

With this solution, as the 172.17.0.0/16 range is burn by the 172.17.255.254 dev lo; route, this range is no longer available for docker networks. The bip value allows to change the default bridge subnet value after the docker service restart.

The best solution: update the template and the docker4x images

It would be really useful to be able to choose a global docker networks addressing range (as 10.128.0.0/9 for example), or to reserve some subnets in the CloudFormation template.

Do you have another way to fix the problem ?

ddebroy commented 6 years ago

Hi @mRoca thanks for the detailed report and analysis. I think an alternative to "burning" the address maybe to configure the docker_gwbridge to a certain subnet as mentioned in the docs: https://docs.docker.com/engine/swarm/networking/#customize-the-docker_gwbridge

Will that help with your configuration?

We can bubble up the bip and docker_gwbridge subnets as parameters in the template.

mRoca commented 6 years ago

If I create the docker_gwbridge before the swarm init by setting my PRE_INIT_SCRIPT parameter to

docker network create --subnet 172.19.0.0/16 --opt com.docker.network.bridge.name=docker_gwbridge --opt com.docker.network.bridge.enable_icc=false --opt com.docker.network.bridge.enable_ip_masquerade=true docker_gwbridge

it will do the job until we create another network. At the first docker network create foo, the docker daemon will get the first available /16 ip range as subnet. Here, the foo network will have the 172.17.0.0/16 subnet. When a newtork subnet is declared, all packets to the corresponding ip range will be routed internally. It's so the same situation (except if we specify ALL new network subnets) :(

ddebroy commented 6 years ago

For the subsequent, docker network create commands, you can specify the subnet and various other options (if the defaults do not work for your environment/scenario) as documented here: https://docs.docker.com/engine/reference/commandline/network_create/#specify-advanced-options. Does that help with avoiding the conflict?

jderusse commented 6 years ago

That's mean the application/stack (which describes the docker-compose.yaml configuration for instance) knows the infrastructure (to avoid conflicted ip range), which IMHO should not be it responsibility. Moreover 2 versions of the same stack can't be deployed in the same swarm without keeping a referential to know which IP is burned or not.

The solution may help to avoid the conflict, but sounds like a big hack and will be really hard to maintain. Same for IP in containers, most of the time we shouldn't have to define a fix IP, we just let the infrastructre/docker pick one for us.

docker-archive / for-aws

Conflict between an existing VPC CIDR and the docker network subnets #117