QubitProducts / bamboo

HAProxy auto configuration and auto service discovery for Mesos Marathon
Apache License 2.0
794 stars 214 forks source link

bamboo stalls during deployment once haproxy config is invalid. #136

Closed rasputnik closed 9 years ago

rasputnik commented 9 years ago

our ops do something a bit - odd - during deploys;

when the old app is destroyed, bamboo thinks the app has a port of '0' and so generates an invalid haproxy config. at this point bamboo stops responding to webhooks so the config is never regenerated and we lose service.

Arguably this is pilot error but I'm having a hell of a job figuring out how bamboo becomes unresponsive.

Wondering if anyone's seen anything similar? I'm about to break out the sysdig on a test instance, but some pointers would help.

j1n6 commented 9 years ago

Thanks for reporting this. Are you using the multi-port support configuration?

rasputnik commented 9 years ago

No, this is just a single port (the commented out 'service port' example is mine). Seems like at some point in the dance, the serviceports start coming back from bamboo as '0' (which I'm guessing is the Go 'nil value' for integers). Marathon is available during this deployment as far as I can make out.

I've disabled one of the gateway servers on the environment in question today so I'll run bamboo on there and wait for it to break, will report back if that turns up any more information.

j1n6 commented 9 years ago

Thanks for the update. It sounds like an edge case we need to filter out. If you can reproduce the problem with specific steps, the fix might be easy to find.

rasputnik commented 9 years ago

The isolated gateway broke yesterday afternoon. I'm still not 100% sure how it got into this state (was offsite at the time annoyingly) but here's how it looks now.

the haproxy config is certainly invalid e.g. we have :

    listen ::appname-2.2.3-rc1_18005
  bind *:18005
  mode http

  balance leastconn
  capture request header X-Forwarded-For len 15

    server ::appname-2.2.3-rc1-slavename-31911 slavename:31911

listen ::appname-f48cea4_0
  bind *:0
  mode http

  balance leastconn
  capture request header X-Forwarded-For len 15

    server ::appname-f48cea4-slavename-31687 slavename:31687

(where 2.2.3-rc1 is the version they've just deployed, and 'f48cea4' is the old scaled down one).

It's not being 'healed' because bamboo state is invalid (from a curl to 0:8000/api/state ):

  {
    "Env": {},
    "ServicePorts": [
      18007
    ],
    "ServicePort": 18007,
    "Tasks": [
      {
        "Ports": [
          31340
        ],
        "Port": 31340,
        "Host": "slavename"
      }
    ],
    "HealthCheckPath": "",
    "EscapedId": "::appname-2.2.3-rc1",
    "Id": "/appname-2.2.3-rc1"
  },
  {
    "Env": null,
    "ServicePorts": null,
    "ServicePort": 0,
    "Tasks": [
      {
        "Ports": [
          31710
        ],
        "Port": 31710,
        "Host": "slavename"
      }
    ],
    "HealthCheckPath": "",
    "EscapedId": "::appname-f48cea4",
    "Id": "/appname-f48cea4"
  },

The logs show webhooks arriving every 30 seconds as normal, but the state remains invalid.

Now it's in a broken state I've traced it (using sysdig with the echo_fds chisel) and can see this happening:

rasputnik commented 9 years ago

Just doublechecked marathons state - although there is no longer an entry for the old app under /v2/apps, the old task is still showing up at /v2/tasks:

...
  {
    "version": "2015-04-09T07:59:25.530Z",
    "stagedAt": "2015-04-13T14:40:40.584Z",
    "startedAt": "2015-04-13T14:52:32.956Z",
    "ports": [
      31687
    ],
    "host": "slavename",
    "id": "appname-f48cea4.0e8f164d-e1eb-11e4-bf47-001a4aa850cf",
    "appId": "/appname-f48cea4"
  },
...
    {
      "servicePorts": [
        18005
      ],
      "version": "2015-06-15T13:58:54.000Z",
      "stagedAt": "2015-06-15T13:59:21.431Z",
      "startedAt": "2015-06-18T11:20:51.431Z",
      "ports": [
        31911
      ],
      "host": "slavename",
      "id": "appname-2.2.3-rc1.b8e586c3-1366-11e5-bf47-001a4aa850cf",
      "appId": "/appname-2.2.3-rc1"
    },

(Notice that there's no servicePort on that first 'f48c...' task, and the appId is no longer present in /v2/apps)

I don't know if this is a Marathon bug (I'm running 0.7.5)? Either way, would it be hard to only add tasks into bamboo if the associated app exists?

j1n6 commented 9 years ago

First thing I suggest is upgrading to marathon 0.8.1 (This is currently what we are running). We upgraded Marathon from the 0.7.x releases to 0.8.0 then 0.8.1 smoothly.

I also agree that it might be better to check associated apps exists.

rasputnik commented 9 years ago

It's on my todo list, don't worry :)

If I was to send a patch for this I can see a few ways to tackle this, which would you prefer?

in reverse order of effort :

I'd prefer the last option, with a caveat that I'm not 100% sure when that query param was added - so it may not work on older marathons.

j1n6 commented 9 years ago

The second option might be a better alternative. Your proposed fix doesn't affect future roadmap.

We are internally re-designing a new way of how bamboo works not just with haproxy, but any number of exports. Essentially, it's possible to say MyApp can be loadbalacned via HAProxy, DNS services (for udp, long TCP connection), or configuring specialised proxy for Redis clusters.

On 18 Jun 2015, at 15:01, Dick Davies notifications@github.com wrote:

It's on my todo list, don't worry :)

If I was to send a patch for this I can see a few ways to tackle this, which would you prefer?

in reverse order of effort :

a check in the template for service port == 0 some sanity checking in services/marathon/marathon.go to only return tasks with associated apps a rework to get apps and tasks in one request using a GET to /v2/apps/?embed=apps.tasks I'd prefer the last option, with a caveat that I'm not 100% sure when that query param was added - so it may not work on older marathons.

— Reply to this email directly or view it on GitHub.

rasputnik commented 9 years ago

ok, i'll go with the second option then - the redesign sounds interesting too - expect a PR tomorrow sometime, it's probably quicker for me to develop against a known-bad marathon deployment than try to reproduce it here.

rasputnik commented 9 years ago

Should have added - managed to reproduce my original problem on a set of test VMs, and can confirm this resolves the issue when there are 'floating' tasks in the marathon REST responses.

rasputnik commented 9 years ago

Sorry to nag but any objections to merging this in? This version of the code has been running well for us all week whereas previous release crashes each time we deploy.

j1n6 commented 9 years ago

I had a quickly look over the code. One minor issue with the m_app variable name should be camalCase.

We are moving office at the moment, please expect my slow response..

On 26 Jun 2015, at 10:45, Dick Davies notifications@github.com wrote:

Sorry to nag but any objections to merging this in? This version of the code has been running well for us all week whereas previous release crashes each time we deploy.

— Reply to this email directly or view it on GitHub.

rasputnik commented 9 years ago

Thanks, I'll have another crack at it then.

rasputnik commented 9 years ago

Tweaked that commit to camelCase the new variables. Thanks!