Feature Request: Health Check

freeipa / freeipa-container

FreeIPA server in containers — images at https://quay.io/repository/freeipa/freeipa-server?tab=tags

https://quay.io/repository/freeipa/freeipa-server?tab=tags

Apache License 2.0

602 stars 258 forks source link

Feature Request: Health Check #516

Closed ashleyprimo closed 1 year ago

ashleyprimo commented 1 year ago

I think it would be advantageous to add a "simple" health check to the container that maybe looks at ipactl status; I need to take a detailed look, but something along the lines of:

HEALTHCHECK --interval=5m --timeout=1m \
  CMD ipactl status || exit 1

adelton commented 1 year ago

What would be the value of having this "simple" health check? What is the value of having (healthy) printed out? Have you been running FreeIPA containers with --health-cmd for months and did the health indication ever alert you to a problem in the setup before it affected the clients or other parts of the FreeIPA cluster?

I'm not fond of adding this check.

Running ipactl status which just checks a couple of the systemd services can easily signal success even if (some functionality of) the FreeIPA server does not operate as expected. People would then rightfully start complaining that docker ps says (healthy) yet something is broken, and that docker ps lies.

From this point of view, adding freeipa-healthcheck to the image and running /usr/bin/ipa-healthcheck || exit 1 might add more value ... but I'm afraid that an attempt to narrow the fairly broad set of checks that it provides to a single boolean result might not be that useful, the user / admin would need to know how to dig into the details of potential unhealthy state anyway, at least using docker inspect.

Looking at the package dependencies, freeipa-healthcheck is not being pulled in by freeipa-server, so it is not a standard part of a typical FreeIPA installation (in general, on hosts / VMs). So it might be risky adding it to the container images because we'd be on the hook of supporting its correct operation in the containerized environment which is likely different in many subtle ways, when the tool did not seem to get the same level of exposure in various scenarios as the FreeIPA components itself did.

ashleyprimo commented 1 year ago

Have you been running FreeIPA containers with --health-cmd for months and did the health indication ever alert you to a problem in the setup before it affected the clients or other parts of the FreeIPA cluster

I do not think it's good practice to wait until "clients" complain to act, personally. In my personal setup, I use the health status for two things:

Supporting in controlling a virtual IP (VIP) (this check of the container, amongst others, can help detect the status)
Monitoring/Alerting (again, one of many metrics that could be used)

Though I agree a more comprehensive health check would be great - though thus far ipactl status has served me well (but I fully agree, and am aware of it flaws), my proposal was more of an initial effort to get the ball rolling (and share some modifications I have made to improve my own deployments)

adelton commented 1 year ago

I do not think it's good practice to wait until "clients" complain to act, personally.

I did not claim it was. I asked in what cases the ipactl status-based healthcheck prevented that, in your setup.

Supporting in controlling a virtual IP (VIP) (this check of the container, amongst others, can help detect the status)

Can you please elaborate on what you have in mind exactly and how you do that control? Are we talking about plain docker-based deployment or some orchestration setup like Kubernetes? How does it differ from just using docker run's --ip option?

Monitoring/Alerting (again, one of many metrics that could be used)

This is too generic. What situations did you see you setup with the ipactl status healthcheck alert you to?

my proposal was more of an initial effort to get the ball rolling (and share some modifications I have made to improve my own deployments)

I appreciate that and I'm happy to discuss this to potentially come up with something that would be useful as a general mechanism. But to do that, we really need to talk about the specific use-cases.

The problem with adding the HEALTHCHECK to the container image is -- it needs to work well (or at least work not badly) in the many different ways people tend to deploy these containers. If you check the issues in this repo, the variety is really great.

An admin adding a specific check in their own setup when they know how they'd be monitoring it and what actions they'd take is very different from the image providing some "simple" health check, meaning of which would be ... what exactly? I want to make sure we don't add something that would become a support nightmare in the long run, just to have warm fuzzies that we have "some" health check like all the other kids have.

ashleyprimo commented 1 year ago

I did not claim it was. I asked in what cases the ipactl status-based healthcheck prevented that, in your setup.

Yep, sorry that's my bad misread that. To answer your question, yes it has indeed - been one of the factors for successfully invoking a failover of a VIP.

Can you please elaborate on what you have in mind exactly and how you do that control? Are we talking about plain docker-based deployment or some orchestration setup like Kubernetes? How does it differ from just using docker run's --ip option?

Yes, sure. So, in my current setup two debian nodes, running docker and keepalived (VRRP)- on the host machine there is a script executed by keepalived that uses docker inspect to analyse the container's state and reported health. (among other factors)

I would add I am effectively using the "HEALTHCHECK" as a mechanism to determine whether traffic should be sent to a certain FreeIPA node, in a similar fashion to how pod "ready" states work on Kubernetes.

This is too generic. What situations did you see you setup with the ipactl status healthcheck alert you to?

A specific scenario was 389 Directory Server was killed by OOM daemon, the health check triggered an automatic failover of the VIP, and moreover, my monitoring systems alerted me to a failing container which was able to be resolved.

To provide context on my collection of the metric - I have a custom Prometheus exporter that exports the health container health status from all running containers. This is used as an additional metric for tracking faulty systems.

The problem with adding the HEALTHCHECK to the container image is -- it needs to work well (or at least work not badly) in the many different ways people tend to deploy these containers. If you check the issues in this repo, the variety is really great.

Yes, indeed.

An admin adding a specific check in their own setup when they know how they'd be monitoring it and what actions they'd take is very different from the image providing some "simple" health check, meaning of which would be ... what exactly? I want to make sure we don't add something that would become a support nightmare in the long run, just to have warm fuzzies that we have "some" health check like all the other kids have.

Personally, I think HEALTHCHECK provided by Docker is fairly flexible in how you integrate with it. It can be useful, from just a quick glanceable metric within docker ps -a , to a part of a more advanced setup (such as VIPs, Monitoring, ETC). (for example, one easy way docker inspect --format='{{json .State.Health}}' container)

I do think it adds value, and for those who want to go a step further, if the check is within a script - it could be easily overridden without a need to rebuild the entire image (as is the current requirement, to add a healthcheck in this manner or use docker exec etc which is less than ideal - for tests run on the container itself that is).

adelton commented 1 year ago

Yes, sure. So, in my current setup two debian nodes, running docker and keepalived (VRRP)- on the host machine there is a script executed by keepalived that uses docker inspect to analyse the container's state and reported health. (among other factors)

Wouldn't you actually get more precise control over the situation by just running docker exec into those containers from that keepalived, rather than reading docker inspect? With docker inspect you get a delayed information because it is status that got gathered up to five minutes ago, not the current information.

I also wonder, how did you manage for the keepalived-based setup to terminate the TLS connections (that care about the server name) on different FreeIPA replicas correctly?

I would add I am effectively using the "HEALTHCHECK" as a mechanism to determine whether traffic should be sent to a certain FreeIPA node, in a similar fashion to how pod "ready" states work on Kubernetes.

Got it. But for a typical containerized application, the deployment and creation and teardown of replicas are typically more "random" than for FreeIPA clusters where creating a replica is a fairly heavyweight process that you likely don't do on-demand. There might be a valid reason for having something like "read-only" FreeIPA replicas for backup or data-transfer purposes where active operations are not expected but that would typically be handled on the DNS level and exposing those nodes to clients.

A specific scenario was 389 Directory Server was killed by OOM daemon, the health check triggered an automatic failover of the VIP, and moreover, my monitoring systems alerted me to a failing container which was able to be resolved.

The question is then -- shouldn't we be rather looking for a mechanism where the whole container would exit if its services got killed, so that a restart policy would kick in and run the container anew?

I do think it adds value, and for those who want to go a step further, if the check is within a script - it could be easily overridden without a need to rebuild the entire image (as is the current requirement, to add a healthcheck in this manner or use docker exec etc which is less than ideal - for tests run on the container itself that is).

I'm not sure I follow the "if the check is within a script" rationale. You don't need to rebuild the entire image -- you can use docker run --health-cmd on the existing images, or you can just add that one layer of HEALTHCHECK definition on top of the existing published images on Quay and Docker Hub.

As for the docker exec approach -- I consider it a perfectly valid way of getting the information from the container, noted above.

ashleyprimo commented 1 year ago

Wouldn't you actually get more precise control over the situation by just running docker exec into those containers from that keepalived, rather than reading docker inspect? With docker inspect you get a delayed information because it is status that got gathered up to five minutes ago, not the current information.

So, the interval given was just an arbitrary number on my part, you could ship a lower interval - ultimately, the interval just needs to be less than the execution time; the same is true for keepalived.

In respect of docker exec, it's certainly an option that would work; though it feels like the a "bodge" to having a real healthcheck.

I also wonder, how did you manage for the keepalived-based setup to terminate the TLS connections (that care about the server name) on different FreeIPA replicas correctly?

I do not use FreeIPA's built-in CA functionality, I issue certificates independently (amongst the specified server names, is one shared for the VIP address)

Got it. But for a typical containerized application, the deployment and creation and teardown of replicas are typically more "random" than for FreeIPA clusters where creating a replica is a fairly heavyweight process that you likely don't do on-demand. There might be a valid reason for having something like "read-only" FreeIPA replicas for backup or data-transfer purposes where active operations are not expected but that would typically be handled on the DNS level and exposing those nodes to clients.

That's a good point, but then again nothing applies to everyone on the container, there will always be multiple use cases - for those who do not use a healthcheck, what they do will ultimately be unaffected by the presence of one (or the option to have one)

The question is then -- shouldn't we be rather looking for a mechanism where the whole container would exit if its services got killed, so that a restart policy would kick in and run the container anew?

That would be a great feature idea (I would happily use it), but only covers a single possible issue; when I create/use health checks, I look to cover as many bases as possible.

I'm not sure I follow the "if the check is within a script" rationale. You don't need to rebuild the entire image -- you can use docker run --health-cmd on the existing images, or you can just add that one layer of HEALTHCHECK definition on top of the existing published images on Quay and Docker Hub.

That was not something I was aware of, to be honest. Useful to know! :)

adelton commented 1 year ago

So, the interval given was just an arbitrary number on my part, you could ship a lower interval - ultimately, the interval just needs to be less than the execution time; the same is true for keepalived. That's a good point, but then again nothing applies to everyone on the container, there will always be multiple use cases - for those who do not use a healthcheck, what they do will ultimately be unaffected by the presence of one (or the option to have one)

Well, this is not true -- the healthcheck consumes resources (especially if we did ipa-healthcheck), and the shorter the interval, the more CPU you'd be burning with no benefit for use-cases that will not consume the results. This is why I'm very cautious about enabling something in the image ... because it will affect everyone, so the defaults and the default behaviour matter.

x3nb63 commented 1 year ago

i have this in use for quite some time now (kubernetes yaml):

           readinessProbe:
             exec:
               command: ["/usr/bin/systemctl", "is-active", "--quiet", "ipa"]
             initialDelaySeconds: 60
             timeoutSeconds: 10
             periodSeconds: 30
             successThreshold: 1
             failureThreshold: 3
           livenessProbe:
             initialDelaySeconds: 600
             periodSeconds: 30
             httpGet:
               path: /
               port: 80

x3nb63 commented 1 year ago

this command

           readinessProbe:
             exec:
               command: ["/usr/bin/systemctl", "is-active", "--quiet", "ipa"]
...

is actually a serious problem as a healthcheck because it will return false while a replica is enrolled and/or decides to upgrade itself, in which case

a) Kubernetes may kill the pod right in the middle

and/or

b) (initial) LDAP replication fails cause the Service object matching this instance won't pass traffic to the container

case b) is a really tough one cause the log of the about-to-get-installed replica will tell something about authentication timeout with its name:port 389 ... but what its actually telling is, that the server from which replication takes place is unable to connect+send data to the new replica on port 389. That is so because the Service object does not pass traffic because the probe tells K8s the pod is not ready ...

To get around this I had to use command: ["/bin/true"] for the probe while IPA performed its enrolling and/or migrations.

A healthcheck feature would need to have this smartness, somehow.

adelton commented 1 year ago

I haven't seen the problem you report with readinessProbe. Until the pod is ready for the first time, no killing of the pod happens even if I change the settings as low as initialDelaySeconds: 10 and failureThreshold: 1.

As for the Service not having the endpoint IPs populated until the pod is ready, that's a chicken'n'egg problem which prevents the connection check and the initial replication and thus getting the replica ready. Using publishNotReadyAddresses: true worked for me and I've now pushed https://github.com/freeipa/freeipa-container/commit/0b245157f86ab315fd15c2f61fa57def1f0b9847 to master to have this setup with master and replica in Kubernetes (https://github.com/freeipa/freeipa-container/blob/master/tests/freeipa-replica-k3s.yaml) tested in the GitHub Actions CI runs.

adelton commented 1 year ago

I've now added (free)ipa-healthcheck packages to the images based on latest versions of OSes to master (https://github.com/freeipa/freeipa-container/commit/3502643b8afde2e2d2e3b210645435ff2829b5bc). So it should be easy to use that in --health-cmd or readinessProbe/livenessProbe/startupProbe configurations without rebuilding the images.

I currently don't plan to "bake" the HEALTHCHECK into the image itself for the reasons outlined above -- I believe this should be part of the decisions that the admin has to made, depending on their specific environment and deployment setup, pipeline, and monitoring.