dcos / examples

DC/OS examples
Apache License 2.0
138 stars 142 forks source link

What's going on when the restart traefik instance on dcos? #363

Open hbceylan opened 6 years ago

hbceylan commented 6 years ago

What's going on when the restart traefik instances on dcos? Our microservices are unreachable? Yes! How can I handle this?

screen shot 2018-06-06 at 21 52 09 screen shot 2018-06-06 at 21 52 50
judithpatudith commented 6 years ago

Hi! In order to get community help with this would you mind posting on either the users mailing list users@dcos.io or Slack at chat.dcos.io? I don't know too much about Traefik but you might find someone there who does 🙂

ryadav88 commented 6 years ago

@deric ^

deric commented 6 years ago

@hbceylan Which Traefik package version do you use?

In the latest version there's a healthcheck configured on $PORT0:

  "healthChecks": [
    {
      "gracePeriodSeconds": 20,
      "intervalSeconds": 5,
      "maxConsecutiveFailures": 2,
      "portIndex": 0,
      "timeoutSeconds": 2,
      "delaySeconds": 15,
      "protocol": "MESOS_HTTP",
      "path": "/ping"
    }
  ],

in your case it appears that port 80 ("portIndex": 0) is used for public connections and does not respond to /ping (healthcheck request). Port 8080 is probably the "admin" interface entrypoint, that is configured to respond to healthchecks. Judging from the screenshot you should probably use:

      "portIndex": 1,

or reorder ports, so that healthchecks will pass (check error log). Also when you use:

  "upgradeStrategy": {
    "minimumHealthCapacity": 0.5
  },

it means that you'll need at least 2 public nodes, because you're allocating fixed ports 80,443,8080 which can't be allocated to multiple instances at the same time. When restarting task Marathon will kill one instance, stage the job and wait until healthcheck passes, then restart the remaining instance(s).