TritonDataCenter / containerpilot

A service for autodiscovery and configuration of applications running in containers
Mozilla Public License 2.0
1.13k stars 136 forks source link

Register critical services in consul. #552

Closed eliasbrange closed 6 years ago

eliasbrange commented 6 years ago

Howdy, we currently have about 200 services in consul, and filtering them on only failed ones is a quick way for us to see that everything is up and running. However, containerpilot only registers a service after it has become healthy for the first time due to job.sendHeartbeat() being called in onHealthCheckPassed.

If for some reason the service never gets healthy, the failed filter in consul does not give us that information. Currently, we have to manually search for the service that usually misbehave on startup.

It would be great to register a service as critical in consul in onHealthCheckPassed, but only if the services hasn't been seen before to not mess with the TTL.

jwreagor commented 6 years ago

Initially I saw how this aligns with some of the language in the Consul docs on Initial Health Check Status.

By default, when checks are registered against a Consul agent, the state is set immediately to "critical". This is useful to prevent services from being registered as "passing" and entering the service pool before they are confirmed to be healthy.

I'm not 100% positive if that is compatible with consumers of ContainerPilot today. I'll have to double check but there might be Autopilot Pattern applications that draw a distinction between a service that is unregistered (and now would be marked as initially critical) and a service that is registered, healthy, then made critical. Things like fail over could be signaled from a service being marked as critical. Under this change any service that has not been brought up healthy at all would signal an improper fail over.

This could be left up to the consumer of ContainerPilot so it's possible we could stick this behind an optional setting.

Thoughts?

eliasbrange commented 6 years ago

Yeah this might have side effects for applications that differ between unregistered and critical services as you said. One way could be to either use another status such as warning. Or an optional setting that could either be made "global" or job-specific. Job configuration could for example have

{
    name: 'job',
    health: { ... },
    register_critical: True // Defaults to False
}

This way it should not break any existing applications.

misterbisson commented 6 years ago

I think you're on the right track to create a flag for this behavior, but I have to admit I'm concerned about register_critical. I'm concerned about how users interpret the critical bit.

Whatever the name, the actual intent is that the service be registered as soon as the container starts, before the service is tested as healthy. I don't have any suggestions now, but I'm trying to think of other ways to describe that.

eliasbrange commented 6 years ago

register_unhealthy or register_failed a better name?

That's true. The registration could happen elsewhere. It must happen after the consul agent is up and running though, so it probably has to happen in some of the processEvent methods.

jwreagor commented 6 years ago

Both Consul and Nomad call a similar (if not identical) feature initial_status. I don't think that naming would hurt here but interested in how that sounds.

On where to configure, I would agree that since it's tied to a job's registration it should be on the job itself. Any job can be a service and have an initial status value, or not. I think that plays nicely into this.

misterbisson commented 6 years ago

I think the initial_status suggestion makes sense. I guess the docs would read something like this?

  • initial_status (optional): if provided, ContainerPilot will immediately register the application before the first successful health check with this status.

I'm using both "immediately" and "before the first..." there for emphasis about the difference against the current behavior (and the behavior if the option is unset).

eliasbrange commented 6 years ago

I like the initial_status idea. However, I do not know the best place to call register. If consul isn't started yet the registration will fail, and a retry will need to be done somewhere.

eliasbrange commented 6 years ago

Also, initialStatus instead of initial_status to keep it in line with other config names?

jwreagor commented 6 years ago

Correct, to keep parity across systems and configuration.

My thoughts for the usage of initial_status would be to allow the manual configuration of the initial status value itself along with making this optional in ContainerPilot.

All others options would pass directly through the CP configuration into the Consul registration under health check. These values could include any of the values that Consul accepts including critical, warning, or even passing (which would be api.HealthPassing that we perform today).

The configuration could live under the job's health check. Here's an example...

{
  consul: "localhost:8500",
  jobs: [
    {
      name: "nginx",
      port: 80,
      interfaces: ["eth0"],
      exec: "nginx",
      health: {
        exec: "health-check http",
        interval: 10,
        ttl: 25,
        initial_status: "critical",
      }
  ],
}

Am I conflating things or does that align with your use case? @eliasbrange

jwreagor commented 6 years ago

@misterbisson I think your documentation suggestion stands and we would just note the values that we can set through the Consul API.

  • initial_status (optional): if provided, ContainerPilot will immediately register the application before the first successful health check with this status. Acceptable statuses include critical, warning, or passing.
eliasbrange commented 6 years ago

I've just pushed some updates that you could look at. I added the config under jobs but it could just as well be moved to jobs.health. That might actually make more sense.

jwreagor commented 6 years ago

Haven't had time yet to review latest changes but I'll try to fit it in next week.

eliasbrange commented 6 years ago

Let me know if you'd rather have the configuration under jobs.health.

jwreagor commented 6 years ago

I believe we were all set on that PR so I went ahead and merged. I'll wait till next week to get the minor release out.