Local consul agent cannot reach consul server via RPC on a Swarm

lucj commented 7 years ago

I'm trying ContainerPilot on a Swarm and got an error from a service as its local consul agent cannot connect to the consul server.

The Compose file I'm using


version: '3.3'
services:
  consul:
    image: consul:0.9.2
    command: agent -server -client=0.0.0.0 -bootstrap -ui -bind '{{ GetInterfaceIP "eth0"  }}'
    dns:
      - 127.0.0.1
    networks:
      - appnet
    ports:
      - "8500:8500"
  api:
    image: myorg/api
    command: ["containerpilot"]
    networks:
      - appnet
  db:
    image: autopilotpattern/mongodb
    networks:
      - appnet
    volumes:
      - mongo-data:/data/db
volumes:
  mongo-data:
networks:
  appnet:

The thing is the db service, based on the autopilotpattern/mongodb, is correctly registered in Consul but the api service is not.

In a manage.sh file in the api service, I have added a sed command to set the bind_addr of the consul agent.


#!/bin/sh

event=$1
echo "Received event:[$event]"

if [ "$event" = "prestart" ];then

  # Update the Consul '-advertise' address to use the interface ContainerPilot was told to listen on
  echo "IP set for current API container: ${CONTAINERPILOT_API_IP}"
  sed -i "s/IP_ADDRESS/${CONTAINERPILOT_API_IP}/" /config/consul.json

  # Wait for the db to be available
  while [[ "$(curl -s http://localhost:8500/v1/health/service/mongodb-replicaset | grep passing)" = "" ]]
  do
    echo "db is not yet healthly..."
    sleep 5
  done
  echo "db is healthly, moving on..."
  exit 0
fi

# If db not accessible anymore, restart the api service
if [ "$event" = "db-change" ];then
  pkill -SIGHUP node
fi

But I got the following error from its logs:


2017/09/01 17:05:26 [ERR] consul: RPC failed to server 10.0.0.7:8300: rpc error: failed to get conn: dial tcp 10.0.0.2:0->10.0.0.7:8300: i/o timeout
2017-09-01T17:05:26.992751673Z     2017/09/01 17:05:26 [ERR] http: Request GET /v1/health/service/mongodb-replicaset?passing=1, error: rpc error: failed to get conn: dial tcp 10.0.0.2:0->10.0.0.7:8300: i/o timeout from=127.0.0.1:52830
2017-09-01T17:05:26.993280146Z failed to query mongodb-replicaset: Unexpected response code: 500 (rpc error: failed to get conn: dial tcp 10.0.0.2:0->10.0.0.7:8300: i/o timeout) []
2017-09-01T17:05:26.993973624Z     2017/09/01 17:05:26 [ERR] consul: RPC failed to server 10.0.0.7:8300: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2017-09-01T17:05:26.994033652Z     2017/09/01 17:05:26 [ERR] agent: failed to sync changes: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2017-09-01T17:05:26.994062418Z     2017/09/01 17:05:26 [ERR] consul: RPC failed to server 10.0.0.7:8300: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2017-09-01T17:05:26.994080896Z     2017/09/01 17:05:26 [ERR] consul: RPC failed to server 10.0.0.7:8300: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2017-09-01T17:05:26.99410307Z     2017/09/01 17:05:26 [ERR] agent: Coordinate update error: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2017-09-01T17:05:26.994125344Z     2017/09/01 17:05:26 [ERR] http: Request GET /v1/health/service/mongodb-replicaset, error: rpc error: failed to get conn: rpc error: lead thread didn't get connection from=127.0.0.1:52942
2017-09-01T17:05:26.996276493Z db is not yet healthly...

The value set for the CONTAINERPILOT_API_IP env var is 10.0.0.2.


/app # cat /proc/15/environ | tr \\0 "\n"
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=75448ab9cf09
VERSION=v6.9.4
NPM_VERSION=3
LAST_UPDATED=20170515T152500
CONTAINERPILOT_VER=3.3.0
CONTAINERPILOT=/etc/containerpilot.json5
HOME=/root
CONTAINERPILOT_PID=10
CONTAINERPILOT_API_IP=10.0.0.2
CONTAINERPILOT_CONTAINERPILOT_IP=10.0.0.2

If I check the interfaces of the api service I get 2 IPs, one for the container, the other one for the service. Could the wrong IP be used in this case ?


259: eth0@if260:  mtu 1450 qdisc noqueue state UP
    link/ether 02:42:0a:00:00:03 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.3/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.0.0.2/32 scope global eth0
       valid_lft forever preferred_lft forever

lucj commented 7 years ago

Seems to work better when the IP used is the /24 one. For test purposes I've changed the manage.sh so the preStart setup the IP like the following:


IP=$(ip a | grep '/24' | cut -d'/' -f1 | awk '{print $2}')
sed -i "s/IP_ADDRESS/$IP/" /config/consul.json

Ugly... I know, but it's just for the test :) I do not have the connection error anymore when this IP is used. What would be your recommendation ?

Without the error, I can then go one step further but then the api cannot connect to mongo. It seems this one is not elected to PRIMARY (remains in OTHER state). Do you think this is also linked to network / IP problem ?

lucj commented 7 years ago

In the containerpilot.json5, I have added the interfaces directive in the API job so the correct IP is retrieved:


      interfaces: [
        "192.168.0.0/16",
        "10.0.0.0/24",
        "eth0",
        "eth1",
      ]

I was hoping the IP on subnet 10.0.0.0/24 would be taken instead of the /32 but it's not working as expected, the IP 10.0.0.4/32 is still the one advertised.

lucj commented 7 years ago

@tgross any idea why the IP on 10.0.0.0/24 is not the one advertise (and thus set in the CONTAINERPILOT_API_IP) ?

tgross commented 7 years ago

@tgross any idea why the IP on 10.0.0.0/24 is not the one advertise (and thus set in the CONTAINERPILOT_API_IP) ?

The interfaces spec searches for matches in the order listed but also in the order that the interfaces are returned by the stdlib and then ordered alphabetically by interface name, then by IP address (lexicographically by bytes). So in other words, each interface is matched against the whole spec before moving on to the next spec.

Your different examples are mismatched and have the same interface ID, so maybe there's a typo when you added them on GitHub that's making it hard for me to tell the exact behavior you're seeing. But I suspect it has to do with the order the interfaces are being returned.

Also, just FYI I'm not longer the lead on this project, having moved on to a new gig at density.io. I'm still contributing but pinging me will not get you the fastest results anymore!

lucj commented 7 years ago

First of all congrats for your new gig. I'm not sure about what you mean regarding the mismatched examples though. I observed the expected behavior (service is able to connect to the consul server) when the IP 10.0.0.3/24 is selected, but is not when IP 10.0.0.2/32 is selected. This is probably due to the service VIP (in Swarm) but I'm not 100% sure.

jwreagor commented 7 years ago

Hello there @lucj, I'm not positive what the ContainerPilot problem is here besides selecting the appropriate IP address. Do you think there's a bug in CP or is this deriving out of the networking setup that Swarm is providing you?

lucj commented 7 years ago

Hi, Well, I'm not sure this is a pb with CP but the thing is I do not manage to get the correct IP specifying the subnet I need in the interfaces key. My understanding is that by specifying 10.0.0.0/24 as the first interfaces in the list, the IP in this subnet is the one which should be selected. On the other hand, the network specificities used in Swarm might make the selection harder then as 2 IPs are available in the same subnet (but not same CIDR). I can use some quite ugly stuff to make sure to get the IP in the /24 (with grep / sed /...) but I'm looking for a better way to do this using CP. Any suggestions?

Le 7 sept. 2017 4:02 PM, "Justin Reagor" notifications@github.com a écrit :

Hello there @lucj https://github.com/lucj, I'm not positive what the ContainerPilot problem is here besides selecting the appropriate IP address. Do you think there's a bug in CP or is this deriving out of the networking setup that Swarm is providing you?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/joyent/containerpilot/issues/503#issuecomment-327827313, or mute the thread https://github.com/notifications/unsubscribe-auth/AALIhosPw0S8_3xQpk4-B1UFUZ0qFuOqks5sgAWEgaJpZM4PKaf- .

jwreagor commented 7 years ago

I definitely understand the context of the problem, the intersection of how ContainerPilot chooses an interface and advertises it's IP address, but I'm uncertain of the exact cause or solution. It appears like the interface (or IP address?) is chosen by the CIDR block matching first in lexical ordering (as @tgross has mentioned).

Shot in the dark but it might help to separate your control plane from your data plane by using separate interfaces (if that's possible for you).

$ docker swarm init --advertise-addr 10.0.0.1 --datapath-addr 11.0.0.1

lucj commented 7 years ago

@cheapRoc I have not tried yet to separate control and data planes yet, but I figured out that using the "endpoint_mode: dnsrr" fixed the thing as no more VIP is associated to the service. For each service defined with this mode, the correct IP (CIDR /24) is retrieved 👍

The problem then is more on the consul server service then. As I publish port 8500 for the consul server, another IP is used to advertise this one (on 10.0.255.0/24 by default). It seems this IP is sometimes on eth0, sometimes on eth2 with makes it quite hard to know which interface to use (in the GetInterfaceIP) when running it.

version: '3.3'
services:
  consul:
    image: consul:0.9.2
    command: agent -server -client=0.0.0.0 -bootstrap -ui -bind '{{ GetInterfaceIP "eth0"  }}'
    dns:
      - 127.0.0.1
    networks:
      - appnet
    ports:
      - "8500:8500"
  api:
    image: myorg/api
    command: ["containerpilot"]
    networks:
      - appnet
    deploy:
      endpoint_mode: dnsrr
   ...

jwreagor commented 7 years ago

@lucj Sounds like you're on the right path. How many interfaces/IPs are presented to your Consul service? If it's just one, remove the -bind argument all together. Would that help?

lucj commented 7 years ago

@cheapRoc Seems like it's more complicated with the consul service though.

With the above configuration, there are several interfaces:

one linked to "appnet" with 2 IPs (same subnet but 2 CIDR, /24 /32)
one linked to docker_gwbridge (with IP in subnet 172.18.0.0/16)
one linked to ingress network (due to the publication of port 8500).

As I do not manage to get the appnet /24 IP, I've changed the configuration so it:

uses endpoint_mode: dnsrr => no more VIP address in appnet (the one in /32)
does not publish the port 8500 to the external world => no more interface on the ingress network

It then remains 2 interfaces, each with one private IP: one on appnet and one on docker_gwbridge.

I then need to get the one on appnet as the Consul agent's advertise address. I thus making progress with this configuration.

    image: consul:0.9.2
    command: agent -server -client=0.0.0.0 -bootstrap -ui -bind '{{ GetInterfaceIP "eth0"  }}'
    dns:
      - 127.0.0.1
    networks:
      - xsole
    deploy:
      endpoint_mode: dnsrr

I'll go in this direction and see how this goes with the whole application 👍 What do you think ?

jwreagor commented 7 years ago

@lucj Apologies again for not getting back. What was your experience like with Swarm and ContainerPilot after your last update? Were you able to bind to the correct interface?

jwreagor commented 7 years ago

Going to close this since there were no other issues that came out of it. Let us know if you find anything else, thanks!

lucj commented 7 years ago

@cheapRoc was not able to have it working against a Swarm. Will give it another try and come back to you soon. Still got some problem with the selection of the correct interface.

TritonDataCenter / containerpilot

Local consul agent cannot reach consul server via RPC on a Swarm #503