allegro / marathon-consul

Integrates Marathon apps with Consul service discovery.
Apache License 2.0
191 stars 33 forks source link

Unexpected response code: 500 (rpc error: No path to datacenter): Sync not working, Services not being de-registered #285

Closed tomwganem closed 6 years ago

tomwganem commented 6 years ago

I am running

marathon-consul: 1.4.2 (3 instances on 3 separate servers)
consul: 0.9.3 (3 instances running in server mode on 3 separate servers, 24 instances, one on each mesos-slave)
marathon: 1.3.12 ( 3 instances on 3 separate servers)

I am running a federated consul cluster, with two datacenters:

root@files-qa-tor01-master-08dc3739-ede8-0207-3d55-325d4ab5754a:~# /opt/consul/consul-0.9.3/bin/consul members -wan
Node                                                              Address              Status  Type    Build  Protocol  DC     Segment
files-qa-dal13-master-2af5a24e-dda1-59be-1172-1dcc769de9b5.dal13  10.187.109.102:8302  alive   server  0.9.3  2         dal13  <all>
files-qa-dal13-master-b812d0d0-a4ac-b576-e354-2f4c3b2820da.dal13  10.187.109.120:8302  alive   server  0.9.3  2         dal13  <all>
files-qa-dal13-master-d5c6bf5b-14da-c9de-5c46-5d7c69d0c358.dal13  10.187.109.106:8302  alive   server  0.9.3  2         dal13  <all>
files-qa-tor01-master-08dc3739-ede8-0207-3d55-325d4ab5754a.dc1    10.115.173.188:8302  alive   server  0.9.3  2         dc1    <all>
files-qa-tor01-master-50cd0a6e-4e1e-c825-abd6-25fc1bf2f357.dc1    10.115.173.176:8302  alive   server  0.9.3  2         dc1    <all>
files-qa-tor01-master-a909f5e9-f41e-9254-f607-136f574834a6.dc1    10.115.173.171:8302  alive   server  0.9.3  2         dc1    <all>

root@files-qa-tor01-master-08dc3739-ede8-0207-3d55-325d4ab5754a:~# /opt/consul/consul-0.9.3/bin/consul members
Node                                                        Address              Status  Type    Build  Protocol  DC   Segment
files-qa-tor01-master-08dc3739-ede8-0207-3d55-325d4ab5754a  10.115.173.188:8301  alive   server  0.9.3  2         dc1  <all>
files-qa-tor01-master-50cd0a6e-4e1e-c825-abd6-25fc1bf2f357  10.115.173.176:8301  alive   server  0.9.3  2         dc1  <all>
files-qa-tor01-master-a909f5e9-f41e-9254-f607-136f574834a6  10.115.173.171:8301  alive   server  0.9.3  2         dc1  <all>
files-qa-tor01-agent-19a699d5-f05f-0986-08b8-72fda1e1e8bf   10.115.173.181:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-1a6f4d43-0974-c638-1318-2fdec6dd4028   10.166.141.136:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-2088e381-4961-6d00-44f2-99306cb5926e   10.115.173.190:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-39e2601b-5bbe-dfeb-c039-5679d0123826   10.115.173.172:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-4601732a-3517-1328-5bf5-9684b6c8d400   10.115.12.70:8301    alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-51f7b81f-49eb-6de7-8fc1-6634d7026cbe   10.166.141.141:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-5d3bede3-b622-f565-2a11-25fb6530374e   10.115.173.168:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-5d54ba85-ffe5-3534-700f-84f2e8694ef2   10.115.12.105:8301   alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-60ea501f-0b9e-25fb-4149-229121684229   10.115.173.175:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-62a42fb6-3684-ecb0-42de-2c01c5bdb18e   10.166.141.132:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-67b76cad-8d2c-d1e6-0409-d3dfea79f934   10.166.141.139:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-67b9fbb5-e24c-aab3-5d83-74a667572e3b   10.115.173.180:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-6cda494b-46b8-906f-adcf-8627bbf95238   10.115.173.170:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-6e8b875d-e992-20c1-9998-4b626a12369a   10.115.12.109:8301   alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-8f1b0cbd-a3e4-712c-ef28-aec8861136d5   10.115.173.184:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-9df96c59-6cb6-f75a-5a96-acac1a5cf0eb   10.115.173.185:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-a8e4bfe5-cc11-535d-e477-1d359fcbb658   10.115.173.183:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-b29edfb3-39e5-bf18-bd95-b09fe6d8eac1   10.115.173.174:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-b7eeba0f-b244-ba57-bc51-f7d566e6edad   10.115.12.115:8301   alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-c33dbb49-9ddc-58c4-7871-23c1d16805e6   10.115.173.167:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-c349a018-b464-3152-f633-3fa9c4627d22   10.115.173.182:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-c812940c-7f53-4a8f-6042-01af7c0eb0ad   10.115.173.179:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-d193bf37-eb9b-5e5e-aa4d-98d249cc0b09   10.115.173.178:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-agent-eed07ba3-09f7-03a4-f4e2-72cd964761fa   10.115.173.189:8301  alive   client  0.9.3  2         dc1  <default>
files-qa-tor01-infra-f2c6eaa5-e314-c5ac-c869-d62861c31c41   10.115.173.187:8301  alive   client  0.9.3  2         dc1  <default>

This is the error I'm getting when starting marathon-consul

time="2018-05-25T05:51:20Z" level=info msg="Starting marathon-consul" Version=1.4.2
{"level":"info","msg":"Sentry DSN is not configured - Sentry will be disabled","time":"2018-05-25T05:51:20Z"}
{"level":"info","msg":"Sending metrics to stdout","time":"2018-05-25T05:51:20Z"}
{"Force":false,"Interval":"15m0s","Leader":"marathon.service.internal.asperafiles.com:8080","level":"info","msg":"Marathon-consul sync job started","time":"2018-05-25T05:51:20Z"}
{"Port":":4000","level":"info","msg":"Listening","time":"2018-05-25T05:51:20Z"}
{"Id":0,"level":"info","msg":"Starting worker","time":"2018-05-25T05:51:20Z"}
{"Id":1,"level":"info","msg":"Starting worker","time":"2018-05-25T05:51:20Z"}
{"Id":2,"level":"info","msg":"Starting worker","time":"2018-05-25T05:51:20Z"}
{"Id":3,"level":"info","msg":"Starting worker","time":"2018-05-25T05:51:20Z"}
{"Id":4,"level":"info","msg":"Starting worker","time":"2018-05-25T05:51:20Z"}
{"Id":5,"level":"info","msg":"Starting worker","time":"2018-05-25T05:51:20Z"}
{"Id":6,"level":"info","msg":"Starting worker","time":"2018-05-25T05:51:20Z"}
{"Id":7,"level":"info","msg":"Starting worker","time":"2018-05-25T05:51:20Z"}
{"Id":8,"level":"info","msg":"Starting worker","time":"2018-05-25T05:51:20Z"}
{"Id":9,"level":"info","msg":"Starting worker","time":"2018-05-25T05:51:20Z"}
{"Location":"localhost:8080","level":"debug","msg":"Asking Marathon for leader","time":"2018-05-25T05:51:20Z"}
{"Location":"localhost:8080","Protocol":"http","Uri":"/v2/leader","level":"debug","msg":"Sending GET request to marathon","time":"2018-05-25T05:51:20Z"}
{"level":"debug","msg":"Node has leadership","time":"2018-05-25T05:51:20Z"}
{"level":"info","msg":"Syncing services started","time":"2018-05-25T05:51:20Z"}
{"Location":"localhost:8080","level":"debug","msg":"Asking Marathon for apps","time":"2018-05-25T05:51:20Z"}
{"Location":"localhost:8080","Protocol":"http","Uri":"/v2/apps?embed=apps.tasks\u0026label=consul","level":"debug","msg":"Sending GET request to marathon","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.185:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.181:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.172:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.182:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.166.141.136:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.12.109:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.12.115:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.178:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.180:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.166.141.141:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.168:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.167:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.170:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.183:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.175:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.184:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.179:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.174:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.190:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.166.141.139:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.12.70:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.189:8500","BasicAuthEnabled":false,"Scheme":"http","SslVerificationEnabled":true,"Timeout":3000000000,"TokenEnabled":false,"level":"debug","msg":"Creating Consul client","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.189","error":"Unexpected response code: 500 (rpc error: No path to datacenter)","level":"error","msg":"An error occurred getting services from Consul, retrying locally or with another agent","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.181","error":"Unexpected response code: 500 (rpc error: No path to datacenter)","level":"error","msg":"An error occurred getting services from Consul, retrying locally or with another agent","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.184","error":"Unexpected response code: 500 (rpc error: No path to datacenter)","level":"error","msg":"An error occurred getting services from Consul, retrying locally or with another agent","time":"2018-05-25T05:51:20Z"}
{"Address":"10.115.173.184","error":"Unexpected response code: 500 (rpc error: No path to datacenter)","level":"error","msg":"An error occurred getting services from Consul, retrying locally or with another agent","time":"2018-05-25T05:51:21Z"}
{"Address":"10.115.173.178","error":"Unexpected response code: 500 (rpc error: No path to datacenter)","level":"error","msg":"An error occurred getting services from Consul, retrying locally or with another agent","time":"2018-05-25T05:51:21Z"}
{"Address":"10.115.12.109","error":"Unexpected response code: 500 (rpc error: No path to datacenter)","level":"error","msg":"An error occurred getting services from Consul, retrying locally or with another agent","time":"2018-05-25T05:51:21Z"}
{"error":"Can't get Consul services: An error occurred getting services from Consul. Giving up","level":"error","msg":"An error occured while performing sync","time":"2018-05-25T05:51:21Z"}
{"Location":"localhost:8080","level":"debug","msg":"Asking Marathon for leader","time":"2018-05-25T05:51:21Z"}
{"Location":"localhost:8080","Protocol":"http","Uri":"/v2/leader","level":"debug","msg":"Sending GET request to marathon","time":"2018-05-25T05:51:21Z"}
{"Host":"localhost:8080","Method":"GET","URI":"/v2/events?event_type=status_update_event\u0026event_type=health_status_changed_event","level":"debug","msg":"Subsciption success","time":"2018-05-25T05:51:21Z"}
{"Location":"localhost:8080","level":"debug","msg":"Asking Marathon for leader","time":"2018-05-25T05:51:26Z"}
{"Location":"localhost:8080","Protocol":"http","Uri":"/v2/leader","level":"debug","msg":"Sending GET request to marathon","time":"2018-05-25T05:51:26Z"}
{"Location":"localhost:8080","level":"debug","msg":"Asking Marathon for leader","time":"2018-05-25T05:51:31Z"}
{"Location":"localhost:8080","Protocol":"http","Uri":"/v2/leader","level":"debug","msg":"Sending GET request to marathon","time":"2018-05-25T05:51:31Z"}
{"Location":"localhost:8080","level":"debug","msg":"Asking Marathon for leader","time":"2018-05-25T05:51:36Z"}
{"Location":"localhost:8080","Protocol":"http","Uri":"/v2/leader","level":"debug","msg":"Sending GET request to marathon","time":"2018-05-25T05:51:36Z"}
{"Location":"localhost:8080","level":"debug","msg":"Asking Marathon for leader","time":"2018-05-25T05:51:41Z"}
{"Location":"localhost:8080","Protocol":"http","Uri":"/v2/leader","level":"debug","msg":"Sending GET request to marathon","time":"2018-05-25T05:51:41Z"}

Some behavior I'm seeing is that services are not being de-registered

root@files-qa-tor01-master-08dc3739-ede8-0207-3d55-325d4ab5754a:~# curl -s "http://marathon.service.internal.asperafiles.com:8080/v2/apps//files-qa/api/puma?embed=app.taskStats&embed=app.readiness" --netrc-file ~/.netrc | jq '.app.tasksRunning'
3

root@files-qa-tor01-master-08dc3739-ede8-0207-3d55-325d4ab5754a:~# curl -s http://localhost:8500/v1/catalog/service/api | jq '.|length'
41

root@files-qa-tor01-master-08dc3739-ede8-0207-3d55-325d4ab5754a:~# curl -s http://localhost:8500/v1/catalog/service/api | jq '.[]|.Address'
"10.115.173.181"
"10.115.173.181"
"10.115.173.190"
"10.115.173.172"
"10.115.173.172"
"10.115.173.172"
"10.115.173.172"
"10.115.173.168"
"10.115.173.168"
"10.115.173.168"
"10.115.173.168"
"10.115.173.175"
"10.115.173.175"
"10.115.173.175"
"10.115.173.180"
"10.115.173.170"
"10.115.173.170"
"10.115.173.170"
"10.115.12.109"
"10.115.12.109"
"10.115.12.109"
"10.115.12.109"
"10.115.173.185"
"10.115.173.185"
"10.115.173.185"
"10.115.173.183"
"10.115.173.183"
"10.115.173.183"
"10.115.173.183"
"10.115.173.183"
"10.115.173.174"
"10.115.173.174"
"10.115.173.174"
"10.115.173.179"
"10.115.173.179"
"10.115.173.179"
"10.115.173.179"
"10.115.173.189"
"10.115.173.189"
"10.115.173.189"
"10.115.173.189"

This is my marathon-consul config

{
  "Consul": {
    "Auth": {
      "Enabled": false,
      "Username": "",
      "Password": ""
    },
    "ConsulNameSeparator": ".",
    "Port": "8500",
    "SslEnabled": false,
    "SslVerify": true,
    "SslCert": "",
    "SslCaCert": "",
    "Token": "",
    "Tag": "marathon",
    "Timeout": "3s",
    "AgentFailuresTolerance": 3,
    "RequestRetries": 5,
    "IgnoredHealthChecks": "",
    "EnableTagOverride": false,
    "LocalAgentHost": ""
  },
  "Web": {
    "Listen": ":4000",
    "QueueSize": 1000,
    "WorkersCount": 10,
    "MaxEventSize": 4096
  },
  "SSE": {
    "Retries": 0,
    "RetryBackoff": "0s"
  },
  "Sync": {
    "Enabled": true,
    "Interval": "15m0s",
    "Leader": "marathon.service.internal.asperafiles.com:8080",
    "Force": false
  },
  "Marathon": {
    "Leader": "02.marathon.service.internal.asperafiles.com:8080",
    "Location": "localhost:8080",
    "Protocol": "http",
    "Username": "",
    "Password": "",
    "VerifySsl": false,
    "Timeout": "30s"
  },
  "Metrics": {
    "Target": "stdout",
    "Prefix": "default",
    "Interval": "30s",
    "Addr": ""
  },
  "Log": {
    "Level": "debug",
    "Format": "json",
    "File": "",
    "Sentry": {
      "DSN": "",
      "Env": "",
      "Timeout": "1s",
      "Level": "error"
    }
  }
}

Here are some my labels that I'm using on my service:

root@files-qa-tor01-master-08dc3739-ede8-0207-3d55-325d4ab5754a:~# curl -s "http://marathon.service.internal.asperafiles.com:8080/v2/apps//files-qa/api/puma?embed=app.taskStats&embed=app.readiness" --netrc-file ~/.netrc | jq '.app.labels'
{
  "urlprefix-qa.asperafiles.com/api": "tag",
  "urlprefix-api.fabio.service.internal.asperafiles.com:31002/": "tag",
  "urlprefix-api.qa.asperafiles.com/": "tag",
  "urlprefix-qa.ibmaspera.com/assets/": "tag",
  "urlprefix-api.qa.ibmaspera.com/": "tag",
  "consul": "api",
  "urlprefix-qa.ibmaspera.com/api/": "tag"
}
dankraw commented 6 years ago

Hi @tomwganem

Unexpected response code: 500 (rpc error: No path to datacenter)

This looks like a Consul problem, the agent (used by marathon-consul to communicate with Consul) is not able to send any rpc messages to Consul servers, as it has not yet joined the cluster correctly. Please check the agents health by calling their http api, whether you are able to fetch services from the Consul catalog etc., or via Consul UI console.

tomwganem commented 6 years ago

It was a consul problem. consul ports were inadvertently closed.

However, there is still an issue with running in federated consul. We run federated consul to stich together two mesos cluster installations running in an active-passive configuration. We we run marathon-consul in our passive site against a consul cluster that is configured to not be in "dc1", it de-registers all services in our active site ("dc1") whenever it does a sync.

tomwganem commented 6 years ago

Should I close this ticket and create a new ticket for this federated consul issue?

tomwganem commented 6 years ago

closing. New issue in #286

dankraw commented 6 years ago

@tomwganem thanks for explanation :)

jessp01 commented 11 months ago

In my particular case (consul used with Prometheus), the datacenter field in /etc/consul.d/consul.json contained upper-cased letters, lower-casing everything fixed the issue. Thought I'd comment on that in case anyone else encounters the same issue.