Possibility to add a healtcheck

amir20 / dozzle

Realtime log viewer for docker containers.

https://dozzle.dev/

MIT License

6.22k stars 310 forks source link

Possibility to add a healtcheck #1810

Closed MetalArend closed 2 years ago

MetalArend commented 2 years ago

Is your feature request related to a problem? Please describe. I'm trying to reliably start up a Dozzle container, and check if the service is running as expected. If I do that with for example a DOZZLE_FILTER environment variable with a filter that is not acceptable, the container will be starting, will become running, but will then fail a second (or so) later. It makes checking for the container unreliable right after it starts. It quickly reports to be running, only to still fail right after. But adding sleep 5 in a script feels like an icky way to go about this.

As has been mentioned in the Dozzle issuetracker before, Docker has a solution to this: adding a healthcheck. I have seen the reasoning to not add the HEALTHCHECK directly in the Dockerfile, and the choice to keep away from things that also have to be maintained without being able to test them. So I wanted to check if I can add something myself. Turns out you are doing the great thing of using multi-stage builds (awesome!), which sadly also means there is no means of in any way adding my own basic healthcheck to the default image, because there's no other possible entrypoint in there than the /dozzle binary. I could recreate my own container, and copy the /dozzle binary and these few other files to that container once again, and add a curl, but I'm never happy to extend good containers, because it uses internal implementation details that might be changing (for good reasons) at any time. So before I go that route, could I pick your brain to resolve it in the Dozzle codebase itself?

Describe the solution you'd like AND alternatives you've considered Would it be possible to have the /dozzle binary have some other endpoints or flags?

/dozzle --healthcheck "green": check if the Dozzle health is green
/dozzle --status "green": check if the Dozzle status reports as green
/dozzle --stats "containers,filter" --stats-status "containers<10": check if the Dozzle reports less than 10 containers to be running

WDYT? Is there maybe a very easy solution to this in the Dozzle codebase? Some entrypoint that is already being used in the tests, that could be easily promoted to production ready? Or some entrypoint that would make the tests easier as well? And would you consider spending time on it? ;)

amir20 commented 2 years ago

Hi @MetalArend,

Overall, I am +1 for adding healthcheck. So all good there. Let's figure out a use case first.

So what is your use case? As far as I know, health checks only work in swarm mode. But I don't support swarm mode completely. How are you using Dozzle and what is your setup?

I'm trying to reliably start up a Dozzle container, and check if the service is running as expected. If I do that with for example a DOZZLE_FILTER environment variable with a filter that is not acceptable, the container will be starting, will become running, but will then fail a second (or so) later

I would argue that checking for filter validity is not the right use case for health checks. Because let's say at a later point, some of the other containers fail and now Dozzle is a bad state.

Sounds like the right approach to check the format of filters and reject unacceptable filters.

re: using scratch with healthcheck

You are right. I did a lot of research and scratch with a curl just didn't make sense. But there could be other ideas.

dozzle --healthcheck "green"

What would this do? I believe when checking with HEALTHCHECK Docker runs a separate process. So in this case, it would have to be an http GET call to check the main process. Is this what you were thinking?

dozzle --status "green" check if the Dozzle status reports as green

I am not sure what this is and how is it any different than above? I try to keep Dozzle as simple as possible. In this case the caller can just check the output of --healthcheck right?

dozzle --stats "containers,filter" --stats-status "containers<10"

As mentioned earlier, I don't think Dozzle should fail if less containers are visible. What if containers are in the process of failing and you need to debug with Dozzle?

Based on some research, it sounds like if you do really want a HEALTHCHECK then the best solution would be create a simple sub command like dozzle healthcheck which would make an http call to localhost:port/healthcheck and check the status of the main process. But that would only check the server is running, docker is running, and etc...

PS. Thank you for the excellent write up on your issue. Makes collaborating easier!

MetalArend commented 2 years ago

So what is your use case? As far as I know, health checks only work in swarm mode. But I don't support swarm mode completely. How are you using Dozzle and what is your setup?

Currently we're are happily using Dozzle as a standalone Docker container, for people without too much cli experience to quickly see what errors are happening in the container.

The tooling around it is basically docker commands (very simplified). We start dozzle, and then wait till the container is running.

docker container run --name="dozzle" ...
until test -n $(docker container list --filter "state=running" --filter "name=^dozzle$"); then sleep 1; fi

The reason a healthcheck would help, is that there's another filter on the run command too: --filter "health=healthy", that can be checked on when one provides the docker container run --health* parameters. It can wait for a certain time, fail x times before really failing, and so on. So the wait would only take as long as docker needs to figure out if the container is ready.

Because let's say at a later point, some of the other containers fail and now Dozzle is a bad state.

I'm having trouble to keep up here. Is there some Dozzle internals that would fail if for example there were not enough running containers? The problem with the filter is that it only fails once it start talking to Docker to retrieve the list, and Docker basically says "no can do". Basically Dozzle starts up, reports running, then still crashes a few seconds later on the Docker error. I have been looking for a clean solution, and gotta be honest, I'm not even sure if I'm on the clean track right now. I could also see it as perfectly acceptable to have dozzle not crash on a bad filter, but simply start, and show an error on the dashboard that docker did not accept the provided filter. But then what? Show no containers? Show all containers? The last one could be a privacy issue.

I am not sure what this is and how is it any different than above.

You're right, the three ways of healthcheck, status, stats, all different ways of the same basic thing: providing that extra entrypoint, and give feedback on the health. I was probably just trying to provide some different setups to try to convince you somehow :p

As mentioned earlier, I don't think Dozzle should fail if less containers are visible. What if containers are in the process of failing and you need to debug with Dozzle?

You're completely right :)

the best solution would be create a simple sub command like dozzle healthcheck which would make an http call to localhost:port/healthcheck and check the status of the main process.

Yes! A /dozzle healthcheck would indeed be awesome, that internal http call sounds perfect, and it could report on server, docker running, if the docker daemon accepts the filtering, if the coffee is ready, anything you could ever think of in the future... Possibly with some extra flags to only report on a specific thing by returning an exit code: /dozzle healthcheck --server --docker, but only because there's no way in the scratch container to use for example jq or shell to filter the response from the healthcheck.

So yeah, at least my troubles would be gone with the following: /dozzle healthcheck --server --docker-reachable --docker-filter-valid, which returns an exit code... Or something like that.

Btw, all the :heart: for Dozzle, I saw a huge improvement in our people actually checking container logs when something goes wrong. Thanks!

amir20 commented 2 years ago

Let's go a little deeper in your use case.

I wonder if we are talking about different things. I am talking about the HEALTHCHECK command docker provides. Docker checks the CMD configured for healthcheck every 30s (by default) and will report it's status. If it fails, then it will restart the container.

What I don't understand is how a healthcheck in your case would help? Say you setup Dozzle with --filter "label=foo" and at some point, the container with label=foo dies. Wouldn't Dozzle fail too?

The problem with the filter is that it only fails once it start talking to Docker to retrieve the list, and Docker basically says "no can do".

If this is your real problem, then let me see what happens if I pass the filter to Docker sooner. I don't think you want a healthcheck at all, you just want to fail fast if filters are incorrect. Correct?

Let me know if this makes sense. I am going to try some bad filters on my side to see how Dozzle behaves.

amir20 commented 2 years ago

@MetalArend while testing, Dozzle already errors out if the filter is not recognized.

❯ ./dozzle --filter "random=bar"
INFO Dozzle version head
FATAL Could not connect to Docker Engine: Error response from daemon: Invalid filter 'random'

That means it should already fail fast.

MetalArend commented 2 years ago

We are speaking of the same thing, I do mean the HEALTHCHECK that one can put in the Dockerfile itself (but then it would be on by default, which might have its drawbacks), or by passing the --health-cmd flag to the docker container run command (or in a compose.yaml file for that matter, or a Swarm config). But healthchecks can be used not only by Swarm, but also by Compose or Docker itself.

The thing is that for the check on the provided filter to run in the Dozzle container, the container always has to be running, to be able to connect to the mounted docker socket. So failing fast will always happen later than Docker reporting the container running.

With a healthcheck however, it will try the /dozzle healthcheck endpoint a certain amount of times before reporting the container to be ready. First it will not be able to get to the healthcheck endpoint => reports not healthy. Then the container will be running, but the healthcheck will report that docker might not be ready => reports not healthy. Then it will be running, but the filter will not be okay => reports not healthy. Wrong filters (like the random=bar you use here) make the container die. At no point in time will the container healthcheck tell Docker that the container is healthy, and then it will fail, so it becomes an exited container. So there is no moment in time with a false positive that the container appeared to be ready.

The difference is in the check I can do after starting the dozzle container with the --health-cmd flag:

until test -n "$(docker container ls --quiet --filter "status=running" --filter "name=^dozzle$" 2>/dev/null)"; do sleep 1; done

until test -n "$(docker container ls --quiet --filter "status=running" --filter "health=healthy" --filter "name=^dozzle$" 2>/dev/null)"; do sleep 1; done

The first one considers the container to be ready, before the provided filter has been checked. The second one will wait until the provided filter has actually been checked (given the dozzle healthcheck endpoint checks on that ;) ).

amir20 commented 2 years ago

Thanks so far. I am embarrassed to say I still don't get it. 😭

We are speaking of the same thing, I do mean the HEALTHCHECK that one can put in the Dockerfile itself (but then it would be on by default, which might have its drawbacks), or by passing the --health-cmd flag to the docker container run command (or in a compose.yaml file for that matter, or a Swarm config). But healthchecks can be used not only by Swarm, but also by Compose or Docker itself.

Right. So it does mean whatever the default healthcheck is, it needs to work for all, unless explicitly overriden to do nothing.

The thing is that for the check on the provided filter to run in the Dozzle container, the container always has to be running, to be able to connect to the mounted docker socket. So failing fast will always happen later than Docker reporting the container running.

I don't get this. Can you explain more with an example? What is container? Is Docker not running yet?

With a healthcheck however, it will try the /dozzle healthcheck endpoint a certain amount of times before reporting the container to be ready.

What has changed? How can the healthcheck report docker isn't ready? If docker isn't ready then nothing is running including healtchecks.

Then it will be running, but the filter will not be okay => reports not healthy. Wrong filters (like the random=bar you use here) make the container die

This is the part I got confused. There are two states: 1) Filter is valid but no containers satisfy that rule, and 2) The filter is invalid.

If there are no containers that match the filter, then that's still GREEN status. This is my point earlier too. Dozzle is working if there are no other containers running. That should be the case.
If the filter is invalid then Dozzle should fail fast. Which it does according to my test.

until test -n "$(docker container ls --quiet --filter "status=running" --filter "health=healthy" --filter "name=^dozzle$" 2>/dev/null)"; do sleep 1; done

This is the part the helps the ambiguity. However, the filter not matching anything is not an error as explained in 1. So are you saying if the filter is invalid. But that should never happen.

Maybe help me with examples to see where the disconnect is. Sorry it's taking a while to understand your use case.

MetalArend commented 2 years ago

So it does mean whatever the default healthcheck is, it needs to work for all, unless explicitly overriden to do nothing.

There is no real need to turn on the HEALTHCHECK in the Dockerfile, imo. But it would be helpful to have the /dozzle healthcheck entrypoint on the binary, because it's the only thing in the image that can be used to turn on a healthcheck if anyone wants to turn it on. If you put it in the Dockerfile with a HEALTHCHECK like this: https://docs.docker.com/engine/reference/builder/#healthcheck, it's being forced to everyone. But if you provide the binary with an endpoint that works, we can add it to the docker run command, to a compose.yaml file, to a swarm setup, only if we want or need it. It makes things opt-in, not forced, as a healtcheck has consequences, like for example docker-compose will not mark a container that is not healthy as usable, traefik will not connect to it, and so on. Which is exactly what one might want sometimes, in a more strict setup, but I could see not everyone being happy with having it always on.

The thing is that for the check on the provided filter to run in the Dozzle container, the container always has to be running, to be able to connect to the mounted docker socket. So failing fast will always happen later than Docker reporting the container running.

I'll try to be more clear :crossed_fingers: There is a check inside the Dozzle image that makes the container (= the instance of the image that is starting up) go dead once it sees that the provided DOZZLE_FILTER is getting back an error from the Docker Daemon that is not a working filter. For example with a typo on "nam=prefix-of-project" instead of "name=prefix-of-project". The thing is that according to Docker, the container always will get to a "running" state before it even starts to run Dozzle inside it, so the container will be marked as running, until the Dozzle binary kicks in, tries to connect to the Docker Daemon, gets the error back that the filter is not valid, and kills the Dozzle process, which lets the Dozzle container die.

What has changed? How can the healthcheck report docker isn't ready? If docker isn't ready then nothing is running including healtchecks.

With the healthcheck added to a docker run command, you know that you can poll Docker if the container is ready in a more explicit way than only "is the container running". A container without a healthcheck can only be checked on its state (starting, running, paused, exited, and so on), but a container with a healthcheck adds the health filter (none, healthy, unhealthy) based on the healthcheck command it is running. The docker isn't ready was a bad example. Sorry about that.

This is the part I got confused. There are two states: 1) Filter is valid but no containers satisfy that rule, and 2) The filter is invalid.

1 is of not a problem, a strange filter is still going to give you an empty dashboard and a running container. 2 is the problem I'm looking to solve, as I want to make the DOZZLE_FILTER a changeable setting with less technical people.

So are you saying if the filter is invalid.

Yes, I am. :)

If the filter is invalid then Dozzle should fail fast. Which it does https://github.com/amir20/dozzle/issues/1810#issuecomment-1187690093.

Currently the container fails fast - but it is still reporting a success state (= running) for a brief moment even when a second later it is gone. With the healthcheck the container fails depending on the more insightful healthcheck settings, which never gets the container into a false-positive state.

Take a mysql container for example. You can check for the mysql running state, but if you only wait for that, it will report as "ready" before the initial database setup has succeeded. Because the mysql container should be running, to give the database a chance to run the setup script. So the container will always be in a running state, before it is ever ready to receive connections. If you however add a healthcheck to see if the mysql process is running, and/or the needed tables are present, and/or the configured port is responding, the container can be checked upon until it is really completely ready, and only then we start to send queries to it.

The use case is:

Create a setup, with a Dozzle container, based on some environment variables from for example a .env file, including the DOZZLE_FILTER variable
Start the setup
Wait for the Dozzle container to be ready - check this with hopefully only a dependency on Docker, as that is the one tool they always need to have, and avoid using curl or wget, as it adds another needed dependency on their machine, only because there's no entrypoint in the Docker container to use
Once Dozzle is ready, redirect them to the dashboard in their browser

amir20 commented 2 years ago

🎉 Now i understand what your problem is. It's when the container is still starting up and Docker thinks it's healthy.

re: Adding HEALTHCHECK to all

In the perfect world, I'd add this to all.

according to Docker, the container always will get to a "running" state before it even starts to run Dozzle inside it, so the container will be marked as running, until the Dozzle binary kicks in, tries to connect to the Docker Daemon, gets the error back that the filter is not valid, and kills the Dozzle process, which lets the Dozzle container die.

Ok so now I understand. I hadn't though of this use case to be honest because I had thought I had made the startup time so fast that this should almost never really happen.

MySQL has the same issue as you pointed it out. I guess that's why some people came up with wait-until-up.sh to solve exactly this issue.

Some thoughts:

Are you actually able to see the container in green while waiting for Dozzle to start? I would assume this is very minimal. But if you can reproduce it then that's good insight.
Do you have a lot of containers? I am trying to figure out the scale to be able to reproduce if needed.
The lazy approach would be just to wait 5 seconds and then test for up time right? Not saying healthcheck isn't a good idea, just seeing if even that doesn't work.

Ideas on implementation:

I could write a file to disk when Dozzle has started and dozzle healthcheck can just check for that file to exist.
The better way to be create /healthcheck end point as discussed in previous comments because that would be a true healthcheck.

Now that I understand, let me try to create a prototype. But see if you can answer my questions above regarding how often you can actually reproduce this. It would be easier to test for me.

MetalArend commented 2 years ago

Thought 1. Are you actually able to see the container in green while waiting for Dozzle to start? I would assume this is very minimal. But if you can reproduce it then that's good insight.

I'm not sure what you mean by seeing the container in green. I'm seeing it as reported as "running" for a quick moment, consistently. It's not when Docker is slow, or when I have a few things running, it is indeed every single time. So even when it fails fast (kudos for that, Dozzle startup is amazingly fast), the until loop is faster :D

Thought 2. Do you have a lot of containers? I am trying to figure out the scale to be able to reproduce if needed.

I experience it the same way when running four projects with 37 containers, or only 1 single Dozzle container. To reproduce, the script I'll add below never fails to reproduce for me. Hope it works for you too.

Thought 3. The lazy approach would be just to wait 5 seconds and then test for up time right? Not saying healthcheck isn't a good idea, just seeing if even that doesn't work.

Yes, indeed, that would be the lazy approach. Maybe even lower it down to 2 seconds, and blame the exceptions on a chaos monkey. :p That would indeed be possible :)

Idea 1. I could write a file to disk when Dozzle has started and dozzle healthcheck can just check for that file to exist.

Totally. Sounds like a totally acceptable and quick solution to me. Only counterpoint I could think of, is that with such a solution that file becomes suddenly public knowledge, and someone will add a volume with that directory xkcd style https://xkcd.com/1172/ But practically a totally sound solution, agreed.

Idea 2. The better way to be create /healthcheck end point as discussed in previous comments because that would be a true healthcheck.

Indeed. In the most basic form "/dozzle healthcheck" reports green when Dozzle frontend can be visited without errors, which would probably be after any and all internal checks.

In the perfect world, I'd add this to all.

Yes. But I still would strongly discourage from it.

Now, as promised, two scripts. Paste in some *.sh file, and run them.

Script for the current setup:

docker container rm --force dozzle-test || true
docker run --rm --detach --name dozzle-test --volume=/var/run/docker.sock:/var/run/docker.sock:ro --env "DOZZLE_FILTER=wrongfilter" amir20/dozzle
until test -n "$(docker container ls --quiet --filter "status=running" --filter "name=^dozzle-test$" 2>/dev/null)"; do
    sleep 1;
done
echo "Dozzle was seen as running by the script, which got us here, assuming all is fine:"
docker container ls --filter "name=^dozzle-test$"
sleep 2
echo "Script expects Dozzle to be still here, but it is gone by now:"
docker container ls --filter "name=^dozzle-test$"

Script for the healthcheck setup (here be dragons, untested):

docker container rm --force dozzle-test || true
docker run --rm --detach --name dozzle-test --volume=/var/run/docker.sock:/var/run/docker.sock:ro --env "DOZZLE_FILTER=wrongfilter" --health-cmd "/dozzle healthcheck" --health-retries 10 --health-timeout 2 --health-start-period 1 --health-interval 1 amir20/dozzle
until test -n "$(docker container ls --quiet --filter "status=running" --filter "health=healthy" --filter "name=^dozzle-test$" 2>/dev/null)"; do
    sleep 1;
done
echo "Dozzle was seen as running by the script, which got us here, assuming all is fine:"
docker container ls --filter "name=^dozzle-test$"
sleep 2
echo "Script expects Dozzle to be still here, and it is, because the healthcheck only gave the green light after all the checks:"
docker container ls --filter "name=^dozzle-test$"

amir20 commented 2 years ago

Great script! I'll update you when I have something.

amir20 commented 2 years ago

Alright try amir20/dozzle:pr-1814. I have a simple pull request you can see in #1814 which enable healtcheck. Using your script above does seem to be working.

MetalArend commented 2 years ago

Just tried it, and everything seems to be working fine. Smells like success!

And with overwriting the start-period to 1s, and interval to 1s, Dozzle is ready super fast :)

MetalArend commented 2 years ago

Just went to check if --health-retries has some default, but it seems not to have one. I'm not quite sure if this means it will mark unhealthy the first time the healthcheck fails. So having a 10s healthcheck start period seems to help with that. But I'm not sure if having a container that is quite fast go by default for a 10 second delay is good for the everyday user of dozzle, it seems to be quite long. WDYT?

MetalArend commented 2 years ago

It's up to you, but I would not add the HEALTHCHECK in the Dockerfile, only add the --health options in the documentation as an extra option to run dozzle with: https://docs.docker.com/engine/reference/run/#healthcheck But it's the last time I'm mentioning this, I promise.

Thanks for the time and effort to add this so quickly!

amir20 commented 2 years ago

And with overwriting the start-period to 1s, and interval to 1s, Dozzle is ready super fast :)

Nice. Maybe I'll try that.

Just went to check if --health-retries has some default

According to the docs, it is --retries=N with default to 3.

I wish Docker had a setting delay to check after one second. Then it could be very fast and keep the interval at something sensible. But it does not. For now, I think 2s retries is probably good enough. This is where k8s shines. Many more options for readiness.

only add the --health options in the documentation as an extra option to run dozzle

My principle has been to make Dozzle work for most out of the box. That means sensible defaults. For this feature, I couldn't think of any use cases that would break. Can you? If so, then I will disable it. But if done right, then I think the default healthcheck should be transparent to all. There is currently a bug where if the port is configured then healthcheck doesn't know. If I can't fix that, I'll have to disable it anyway.

Thanks for the time and effort to add this so quickly!

You're welcome. And thanks for being patient. Your use case using Dozzle actually really excites me. That was the original thought I had when I created Dozzle. I wanted engineering teams to be able debug faster.

amir20 commented 2 years ago

In the process of releasing a build. If no bugs are opened in the new few days then I think healthcheck can work for all.

MetalArend commented 2 years ago

According to the docs, it is --retries=N with default to 3.

it is when you add the HEALTHCHECK in your Dockerfile. It is --health-retries if you use it as a flag for docker run. They are the same, but one is always on until disabled by overwriting it, the other is on demand added.

I wish Docker had a setting delay to check after one second.

Isn't that what --start-period (for HEALTHCHECK) and --health-start-period (for run) are?

For this feature, I couldn't think of any use cases that would break.

The container will not be reachable for any other container in a docker-compose, swarm or network setting, as long as it is not healthy. With a built in HEALTHCHECK instruction, It might mean that reaching it for debugging might not be possible for tools that could previously reach it. With the start-period at 10 seconds, it would also mean that the container is for example not reachable over traefik for that amount of seconds. Second negative possible thing would be that a Dozzle container that runs without activity, is not doing anything. With a healthcheck every 2s, it will activate the Dozzle container with a heartbeat. It would be so nice if Docker would add the possibility to say "stop the checking after 60s or a first healthy health status". These are all not very big issues, as was my original issue, but the middle ground imo would be to add the healthcheck endpoint and document it, without enabling it by default. But then again, if nobody complains in the near future, might be all fine, and me being too cautious :p

Thanks for the quick fix! And thanks for the awesome positive messaging here. it was a joy :)

amir20 commented 2 years ago

--start-period is not delay. It took me a while to get what it was. The doc says "start period provides initialization time for containers that need time to bootstrap. Probe failure during that period will not be counted towards the maximum number of retries." which means that unhealthy status during this time will not kill the container. It will until after period has ended.

You are right about the heart beat. I have been running it for a day now so at least no memory issues. But if it comes to Dozzle not doing anything while it's not being used then it should be disabled by default.

3.12.9 is published now.