CiscoCloud / marathon-consul

bridge Marathon information to Consul KV
Apache License 2.0
85 stars 18 forks source link

support health check information? #24

Open ogrodnek opened 7 years ago

ogrodnek commented 7 years ago

First, thanks for the great mesos and marathon tooling -- it's all been extremely solid and reliable!

I am using marathon-consul along with haproxy-consul for load balancing.

As soon as a task is running, it's added to consul and thus the haproxy config. If an application has a non-trivial start up time, I will start to see 503s. That is, tasks are added to the LB before they are reporting healthy.

I've been trying to figure out the best way to solve this. The app definition in consul does have health check info, so, theoretically I could translate the marathon checks to haproxy checks in my haproxy-consul template. Practically though the marathon checks could be more complex than haproxy can express (e.g. some of the apps are using "COMMAND" marathon health checks). It also seems non-trivial to be able to do the translation in the template, and, there might be a chance haproxy and marathon don't agree.

My current thinking is that it makes sense to leave the health checks evaluated by marathon, and include this information in the KV store (perhaps /<app>/<task>/healthCheck ?)

Then in the template I can check that tasks where their app has a health check is defined is only rendered if the task's health check is passing (i.e. not unknown or failing).

I believe this can be done by adding support for the following event types from the event bus:

and including this new information in the KV.

I am happy to submit a PR for this, and would love to get any feedback on this approach, if there's a better way, etc.

thanks!

ogrodnek commented 7 years ago

I've simplified this a bit, by including a HealthCheckResults field directly into the task (already reported by marathon for GET /v2/apps/<app>/tasks), and updating this in the KV when the health_status_changed_event event is fired.

PR is here: https://github.com/CiscoCloud/marathon-consul/pull/25

would love any feedback, thanks!

bwhaley commented 7 years ago

We are using this in production now and it's working great. Really helps out with the lag time between new tasks being added and actually being alive and healthy. 👍

yogeshnath commented 7 years ago

I recently updated to this version to get rid of 503's, It works great if the healthcheck is http or tcp.

Though if healthcheck is based on command ( say curl), Marathon actually shows it as healthy but consul json has "healthCheckResults" set to null. So now Haproxy fails to pick it up based on healthy value ( "healthy":false).

any reason why?