SensorsIot / IOTstack

Docker stack for getting started on IOT on the Raspberry PI
GNU General Public License v3.0
1.42k stars 303 forks source link

Mosquitto Container Health Check Agent : How to Debug ? #731

Closed Noschvie closed 8 months ago

Noschvie commented 9 months ago

Hello get the mosquitto container health check agent not running. container-health-check

the container is up and running and healthy. mosquitto Up 7 minutes (healthy)

But how can I get the health check agent to be used by Uptime-Kuma?

mosquitto_sub -v -h localhost -p 1883 -t "iotstack/mosquitto/healthcheck" -F "%I %t %p" doesn't respond. Tests are done without providing credentials. Any idea? Thanks!

Paraphraser commented 9 months ago

See if this helps.

Noschvie commented 8 months ago

Checked the wiki, but no success. Did an installation from sratch, RPi 4, Raspberry Pi OS with desktop, 64 bit, using PiBuilder (works perfect, thanks) IOTstack with portainer, mosquitto and node-red Using a simple Node-Red flow for publishing and subscribing "iotstack/mosquitto/node-red" works. But subscribing "iotstack/mosquitto/healthcheck" get no respond.

The script iotstack_healthcheck.sh seems not to be started / running. docker exec mosquitto iotstack_healthcheck.sh leads to a result in Node-Red 9.10.2023, 01:38:23[node: iotstack/mosquitto/healthcheck](http://hadersdorf:1880/#) iotstack/mosquitto/healthcheck : msg.payload : string[28] "Sun Oct 8 23:38:23 UTC 2023"

Paraphraser commented 8 months ago

tl;dr

I think it's working "as advertised".

I'm about to demonstrate two options:

  1. Disable the health-check. That's quick and dirty but it will get you past the problem.
  2. Demonstrate "as advertised". If you can't follow along then I think it probably means that whatever is causing the health-check to misbehave is some peculiarity of your system.

baseline (just as it comes from the IOTstack template)

---

version: '3.6'

networks:
  default:
    driver: bridge
    ipam:
      driver: default

services:
  mosquitto:
    container_name: mosquitto
    build:
      context: ./.templates/mosquitto/.
      args:
        - MOSQUITTO_BASE=eclipse-mosquitto:latest
    restart: unless-stopped
    environment:
      - TZ=${TZ:-Etc/UTC}
    ports:
      - "1883:1883"
    volumes:
      - ./volumes/mosquitto/config:/mosquitto/config
      - ./volumes/mosquitto/data:/mosquitto/data
      - ./volumes/mosquitto/log:/mosquitto/log
      - ./volumes/mosquitto/pwfile:/mosquitto/pwfile

Tests:

  1. Start the container:

    $ UP
    [+] Running 2/2
     ✔ Network iotstack_default  Created                                                     0.2s 
     ✔ Container mosquitto       Started                                                     0.1s 
  2. Give health check time to go "healthy" and then report status:

    $ sleep 30 ; DPS
    NAMES       CREATED          STATUS                    SIZE
    mosquitto   36 seconds ago   Up 34 seconds (healthy)   0B (virtual 19.1MB)
  3. Prove the container will pass messages bi-directionally:

    $ mosquitto_sub -v -h 127.0.0.1 -t "hello" -F "%I %t %p" -C 1 &
    [1] 805891
    $ mosquitto_pub -h 127.0.0.1 -t "hello" -m "test $(date)"
    2023-10-09T13:04:01+1100 hello test Mon 09 Oct 2023 01:04:01 PM AEDT
    [1]+  Done                    mosquitto_sub -v -h 127.0.0.1 -t "hello" -F "%I %t %p" -C 1

    In other words, there is no security. That's the default.

  4. Terminate the container:

    $ DOWN
    [+] Running 2/2
     ✔ Container mosquitto       Removed                                                     0.4s 
     ✔ Network iotstack_default  Removed                                                     0.3s 

option 1 - disable health-check

Add these lines to the mosquitto service definition:

    healthcheck:
      disable: true

Test:

$ tail -4 docker-compose.yml
      - ./volumes/mosquitto/pwfile:/mosquitto/pwfile
    healthcheck:
      disable: true

$ UP
[+] Running 2/2
 ✔ Network iotstack_default  Created                                                     0.2s 
 ✔ Container mosquitto       Started                                                     0.1s 

$ DPS
NAMES       CREATED         STATUS         SIZE
mosquitto   3 seconds ago   Up 2 seconds   0B (virtual 19.1MB)

$ DOWN
[+] Running 2/2
 ✔ Container mosquitto       Removed                                                     0.4s 
 ✔ Network iotstack_default  Removed                                                     0.3s 

Note how references to health check have been removed from the STATUS column.

option 2 - health-check enabled with security active

  1. Return to baseline (ie remove two lines added in option 1).

  2. Start Mosquitto:

    $ UP mosquitto
    [+] Running 2/2
     ✔ Network iotstack_default  Created                                                     0.2s 
     ✔ Container mosquitto       Started                                                     0.1s 
  3. Define username and password:

    $ docker exec mosquitto mosquitto_passwd -b /mosquitto/pwfile/pwfile someuser somepassword
    Warning: File /mosquitto/pwfile/pwfile has world readable permissions. Future versions will refuse to load this file.
    To fix this, use `chmod 0700 /mosquitto/pwfile/pwfile`.Warning: File /mosquitto/pwfile/pwfile owner is not root. Future versions will refuse to load this file.To fix this, use `chown root /mosquitto/pwfile/pwfile`.Warning: File /mosquitto/pwfile/pwfile group is not root. Future versions will refuse to load this file.

    Note to self - that will have to be fixed in the IOTstack template structure

    Fix the problem reported:

    $ sudo chmod 700 ./volumes/mosquitto/pwfile/pwfile 
    $ sudo chown root:root ./volumes/mosquitto/pwfile/pwfile 
    $ ls -l ./volumes/mosquitto/pwfile/pwfile 
    -rwx------ 1 root root 122 Oct  9 12:36 ./volumes/mosquitto/pwfile/pwfile
  4. Provide credentials to health-check script by adding these environment variables:

          - HEALTHCHECK_USER=someuser
          - HEALTHCHECK_PASSWORD=somepassword

    Proof:

    $ grep -A 4 "environment:" docker-compose.yml
        environment:
          - TZ=${TZ:-Etc/UTC}
          - HEALTHCHECK_USER=someuser
          - HEALTHCHECK_PASSWORD=somepassword
        ports:
  5. Enable security:

    $ sudo sed \
        -i.bak \
        -e 's/^#password_file/password_file/' \
        -e 's/^allow_anonymous true/allow_anonymous false/' \
        ./volumes/mosquitto/config/mosquitto.conf 

    Proof:

    $ diff ./volumes/mosquitto/config/mosquitto.conf.bak ./volumes/mosquitto/config/mosquitto.conf
    32,33c32,33
    < #password_file /mosquitto/pwfile/pwfile
    < allow_anonymous true
    ---
    > password_file /mosquitto/pwfile/pwfile
    > allow_anonymous false
  6. UP the container. It's running but the UP will cause docker-compose to notice the environment variables have changed and docker-compose will re-create the container, which will then pick up the altered config and password file.

    $ UP
    [+] Running 1/1
     ✔ Container mosquitto  Started                                                          0.4s 
    
    $ sleep 30 ; DPS
    NAMES       CREATED          STATUS                    SIZE
    mosquitto   36 seconds ago   Up 35 seconds (healthy)   0B (virtual 19.1MB)
  7. Prove that the container is enforcing security

    $ mosquitto_sub -v -h 127.0.0.1 -t "#" -F "%I %t %p" -C 1
    Connection error: Connection Refused: not authorised.
    
    $ mosquitto_pub -h 127.0.0.1 -t "hello" -m "test $(date)"
    Connection error: Connection Refused: not authorised.
    Error: The connection was refused.
Paraphraser commented 8 months ago
Screenshot 2023-10-09 at 18 08 57

Screen shot is with changes proposed by #732 and #733:

  1. Working directory is ~/IOTstack.

  2. Erase persistent store.

  3. UP the container.

  4. Show logs - interesting lines are:

    changed ownership of '/mosquitto/pwfile/pwfile' to 0:0
    mode of '/mosquitto/pwfile/pwfile' changed to 0600 (rw-------)
  5. Create a username and password. No errors returned.

  6. Show username and hash made it into the password file.

Noschvie commented 8 months ago

Currently I'm testing MQTT without security, no username and password defined. I assume that the health check should publish the timestamp to "iotstack/mosquitto/healthcheck" using an interval of 30 seconds. Isn't it? If yes, I'm missing this periodic publish.

Paraphraser commented 8 months ago

Wow! Well, I agree. And, what's more, now that I understand why it isn't working:

Now, granted, those dates are incredibly rubbery because when someone updates their local clone against GitHub, and when someone rebuilds their local container, are all unknowns. It's entirely possible that someone would still be running a container built between 2021-05-24 and 2022-04-06, in which case the health-check would all be working as originally intended.

It's really only someone who has built Mosquitto since 2022-04-06 who has a health-check that isn't actually working.

That's for a significant fraction of "not working" because, as I'm sure you'll point out, here we are in October 2023 and a Mosquitto container built today will happily report "(health: starting)" for the first 30 seconds, and then report "(healthy)".

But it is, as they say, being loose with the truth.

Before I dive into the intricacies, I'll declare that this is all my own work (both adding the health-check and then breaking it). Mea culpa on steroids. Doh!

Anyway, to go back to taws, the iotstack_healthcheck.sh script gets added to the image by the Dockerfile:

# copy the health-check script into place
ENV HEALTHCHECK_SCRIPT "iotstack_healthcheck.sh"
COPY ${HEALTHCHECK_SCRIPT} /usr/local/bin/${HEALTHCHECK_SCRIPT}

The Dockerfile also sets up the health-check scaffolding:

# define the health check
HEALTHCHECK \
   --start-period=30s \
   --interval=30s \
   --timeout=10s \
   --retries=3 \
   CMD ${HEALTHCHECK_SCRIPT} || exit 1

All by itself, that works (#350). But, then, the Grand Nitwit From Down-Under (ie me) comes along in #521 and decides to "be tidy" by cleaning-up "unused" environment variables:

# don't need these variables in the running container
ENV MOSQUITTO_BASE=
ENV HEALTHCHECK_SCRIPT=
ENV IOTSTACK_ENTRY_POINT=

The problem is that the mechanism that triggers the health-check script evaluates ${HEALTHCHECK_SCRIPT} inside the container each time the health-check script is run. I've helpfully set that variable to null.

Which means it's the equivalent of executing:

$ sh -c ""

Which has a return code of zero.

Which Docker interprets as meaning "healthy".

So, the solution is to be a bit less tidy and just remove those four lines above, including where HEALTHCHECK_SCRIPT is set to null, and then build the container again.

With that done:

  1. Start the container from the freshly-built image:

    $ UP mosquitto
    [+] Running 2/2
     ✔ Network iotstack_default  Created                                           0.2s 
     ✔ Container mosquitto       Started                                           0.1s
  2. Start a background listener:

    $ mosquitto_sub -v -h 127.0.0.1 -t "#" -F "%I %t %p" &
    [1] 950861

    which almost immediately reports:

    2023-10-09T22:41:38+1100 iotstack/mosquitto/healthcheck Mon Oct  9 22:40:48 AEDT 2023
  3. Show the various health-check stages, interspersed with another message received by the background listener:

    $ DPS ; sleep 25 ; DPS
    NAMES       CREATED          STATUS                             SIZE
    mosquitto   12 seconds ago   Up 11 seconds (health: starting)   0B (virtual 19.1MB)
    2023-10-09T22:42:06+1100 iotstack/mosquitto/healthcheck Mon Oct  9 22:42:06 AEDT 2023
    NAMES       CREATED          STATUS                    SIZE
    mosquitto   37 seconds ago   Up 36 seconds (healthy)   0B (virtual 19.1MB)
  4. Wait a bit as more messages roll in at 30-second intervals:

    $ 2023-10-09T22:42:36+1100 iotstack/mosquitto/healthcheck Mon Oct  9 22:42:36 AEDT 2023
    2023-10-09T22:43:06+1100 iotstack/mosquitto/healthcheck Mon Oct  9 22:43:06 AEDT 2023
    2023-10-09T22:43:37+1100 iotstack/mosquitto/healthcheck Mon Oct  9 22:43:37 AEDT 2023
    2023-10-09T22:44:07+1100 iotstack/mosquitto/healthcheck Mon Oct  9 22:44:07 AEDT 2023
  5. Clean up:

    $ kill %1
    [1]+  Done                    mosquitto_sub -v -h 127.0.0.1 -t "#" -F "%I %t %p"

Two more PRs on the way.

Paraphraser commented 8 months ago

By the way, I now realise that I completely misunderstood your original post.

When you wrote:

Tests are done without providing credentials.

I thought you were telling me that you had set up a password scheme but the health-check was failing because the process running inside the container wasn't using any of your credentials.

That's why I then set about proving (to myself) that credentials could be passed to the health-check script via environment variables.

Sorry for going off on the wrong track.

Still, it did reveal the need to change how the pwfile is set up so some good came of it.

Noschvie commented 8 months ago

Thank you very much! Now it's working as expected.

$ mosquitto_sub -v -h localhost -p 1883 -t "iotstack/mosquitto/healthcheck"
iotstack/mosquitto/healthcheck Mon Oct  9 14:45:22 CEST 2023
iotstack/mosquitto/healthcheck Mon Oct  9 14:45:52 CEST 2023
iotstack/mosquitto/healthcheck Mon Oct  9 14:46:22 CEST 2023
iotstack/mosquitto/healthcheck Mon Oct  9 14:46:53 CEST 2023
iotstack/mosquitto/healthcheck Mon Oct  9 14:47:23 CEST 2023
iotstack/mosquitto/healthcheck Mon Oct  9 14:47:53 CEST 2023
iotstack/mosquitto/healthcheck Mon Oct  9 14:48:23 CEST 2023
Noschvie commented 8 months ago

https://github.com/louislam/uptime-kuma/issues/2405#issuecomment-1753028587

Paraphraser commented 8 months ago

I got an email containing this:

By the way: would it be possible to get the timestamp using the timezone of the container instead of UTC ?

I assume you figured it out and deleted the question.

My question to you is, which method did you use, because there are two:

  1. You can edit the service definition:

        environment:
          - TZ=${TZ:-Etc/UTC}

    to be either (in my case):

        environment:
          - TZ=${TZ:-Australia/Sydney}

    or:

        environment:
          - TZ=Australia/Sydney
  2. You can leave the service definition alone:

        environment:
          - TZ=${TZ:-Etc/UTC}

    and add your timezone to the .env file:

    $ cd ~/IOTstack
    $ echo "TZ=$(cat /etc/timezone)" >> .env
    $ docker-compose up -d

Method 1 works on a per-container basis. Method 2 works for all containers that define TZ=${TZ:-Etc/UTC}.

Not every container supports TZ. In general, if the person who controls the Dockerfile includes the tzdata package then the container has time-zone support; if it's omitted, you're SOL.

That's why the add-on Dockerfile for Mosquitto in the IOTstack template contains:

RUN apk update && apk add --no-cache rsync tzdata

It's not in the "official" image so we have to add it.

Noschvie commented 8 months ago

I assume you figured it out and deleted the question.

yes.

    environment:
      - TZ=${TZ:-Europa/Vienna}

added this to each service. Will change it to your item 2 using the .env file.

Noschvie commented 8 months ago

Because of Uptime-Kuma doesn't support regex for the "MQTT Success Message" it would be fine to get an environment parameter for the healthcheck payload. PUBLISH=$(date) What do you think? Thanks!

Paraphraser commented 8 months ago

To save me some time (and so that I don't have to do a deep dive into Uptime-Kuma to understand it), can you please summarise what Uptime-Kuma can/can't do and what it actually needs to work.

Right now, just using $(date) serves two goals:

  1. the message can be expected to change between invocations; and
  2. it works in Alpine.

That second one is not a trivial concern. A lot of things work differently in Alpine and that has tripped me up often enough to make me very wary. If your goal is to be able to assume that the availability of something like:

environment:
  - HEALTHCHECK_PUBLISH=$(date)

would mean that you'd be able to pass a full set of options to date running inside the container then I'd suggest that you first try this:

  1. Outside the container, run:

    $ date --help

    and observe that you get over 100 lines of help text describing all manner of options.

  2. Repeat the command inside the Mosquitto container:

    $ docker exec mosquitto date --help

    and observe a mere 20-odd lines of help text and far fewer options.

I'd rather not create the kind of maintenance problem where people get to complain that:

date with this set of parameters clearly works outside the container yet when I pass the same parameters to Mosquitto, everything turns to custard!

What I'd rather do is figure out some mechanism that satisfies the two goals I mentioned above and also works with Uptime-Kuma.

Also, right now, the "message" parameter (aka "the payload") of the mosquitto_pub is just whatever raw string comes back from the Alpine version of date. There's no reason why it can't become a JSON string. Something like:

PUBLISH="{\"date\":\"$(date)\",\"uptime\":\"$(uptime)\"}"

That would get you a payload like:

{"date":"Wed Oct 11 08:57:41 AEDT 2023","uptime":" 08:57:41 up 3 days, 11:17, 0 users, load average: 0.16, 0.17, 0.17"}

Would that be useful?

Incidentally, the uptime command run inside any container gets the uptime of the host system:

$ uptime ; docker exec mosquitto uptime ; docker exec nodered uptime
 09:00:03 up 3 days, 11:19,  1 user,  load average: 0.72, 0.30, 0.22
 09:00:03 up 3 days, 11:19,  0 users,  load average: 0.72, 0.29, 0.21
 09:00:03 up 3 days, 11:19,  0 users,  load average: 0.72, 0.29, 0.21

I don't know whether that helps/hinders your quest.

Noschvie commented 8 months ago

Hi Phill Uptime-Kuma are just able to compare a const. string, no expressions / regex are supported. Therefor I have to configure for example

environment:
  - HEALTHCHECK_PUBLISH="Mosquitto healthcheck"
Paraphraser commented 8 months ago

So you're saying Uptime-Kuma can't deal with a string that varies, right?

So, instead of the date (which varies) you want a fixed string. Right?

If "yes" then that defeats the purpose of using $date in the existing health-check script which is to ensure that the string is always different on each run.

The reason it needs to be different is because of the way the health-check script works. It publishes a retained message, and then subscribes to that topic for exactly one message. Because it's a retained message, it will persist until the next time the script runs and publishes a new retained message.

It also doesn't matter what else happens to just about anything in the meantime. Uptime/Kuma could stop and start. The Pi (or whatever) hosting Docker and the mosquitto container could reboot. The mosquitto container could go down and up. The container could be in a restart loop. All of that could happen multiple times but, each time mosquitto is ready for business (even if only for a few seconds), that retained message will always be sent to any subscriber. In short, far from giving you any assurance that mosquitto is working, it creates a false positive.

Having the payload vary is the only way to be certain, on run n+1, that the retained message actually came from run n+1 and isn't from run n, and is masking a problem (like the container being in a restart loop).

I'm not a great fan of retained messages so I did try writing the health-check script without it, by doing things in the opposite order: set up a background listener which would exit after the first message, then publish a non-retained message in the foreground, wait for the background process to finish, then retrieve what it received and do the compare. It just wouldn't work.

Bottom line: the answer to the question of "can we have a fixed string" is "no".


So let me turn the problem around.

Do I assume correctly that Uptime-Kuma simply subscribes to the topic and treats the simple arrival of a message within a particular period as evidence of health?

If "yes" then why not just publish your own message to Mosquitto?

Assume Uptime-Kuma is subscribing to the "/proof/of/concept" topic. If you just run:

$ mosquitto_pub -t "/proof/of/concept" -m "Mosquitto healthcheck"

then Uptime-Kuma will receive "Mosquitto healthcheck", right?

What reception proves is that the MQTT broker (the mosquitto container) is functioning properly. It has been able to receive the published message and distribute it to all registered subscribers. The container is, by definition, working (at least for the duration of the publish/subscribe cycle).

It's not a retained message so receiving the payload proves it was sent "recently".

To make that happen at 60-second intervals, just stitch it to a cron job:

* * * * * mosquitto_pub -t "/proof/of/concept" -m "Mosquitto healthcheck" 2>/dev/null

If you really want it more frequently (eg every 30 seconds) then write a short bash script. Something like this would do the job:

#!/usr/bin/env bash

while : ; do
    mosquitto_pub -t "/proof/of/concept" -m "Mosquitto healthcheck" 2>/dev/null
    sleep 30
done

Stick that in your ~/local/bin with a name like run_uptime_kuma_for_mosquitto.sh and launch it from the crontab at reboot time:

@reboot ./.local/bin/run_uptime_kuma_for_mosquitto.sh

The 2>/dev/null will silence any errors that will be produced if the mosquitto container is down. Publishing operations will resume as soon as the container is up and functioning.

Does that help?

Noschvie commented 8 months ago

Hi Phill

thanks for your detailed explaination.

So you're saying Uptime-Kuma can't deal with a string that varies, right? Yes

So, instead of the date (which varies) you want a fixed string. Right? Yes

But it seems not to be a good idea to use and change the current Mosquitto healthcheck for Uptime-Kuma.

Will use the LWT Topic (Last Will and Testament) from a Tasmota device to check the healthiness. Therefor I will not touch the current Mosquitto healthcheck. Thanks!

Noschvie commented 8 months ago

The solution is very simple : configure the monitor at Uptime-Kume and let the input field "MQTT Success Message" empty. So only the message will be checked but no check to the payload of the message. Great! And simple, isn't it?

Paraphraser commented 8 months ago

So that means you can use the health-check message then?

By the way, thanks for reporting this. Otherwise, I would never have realised anything was wrong. 🤦

Paraphraser commented 8 months ago

Also, the multi-talented Mr @Slyke has just processed the pull request so the changes to the Dockerfile and the entry-point script are now live on GitHub. The basic trick is to do a git status and then git restore «file» anything changed in the ~/IOTstack/.templates/mosquitto directory, and then a git pull will work and you'll be up-to-date again.

Noschvie commented 8 months ago

Yes, the health-check message works. In case of missing health-check messages Uptime-Kuma reports an error (tested by stopping the Mosquitto container).

Paraphraser commented 8 months ago

I opened Mosquitto issue 2923. Even though my misunderstanding of your original post led me down that particular rabbit hole, root ownership feels wrong (at least in the Docker context) and I can't see any reason why the pwfile needs execute permission.

I'm hoping someone who knows at lot more about Mosquitto than I do will cast a knowledgable eye over that issue and either set me straight or agree and propose a fix.

Incidentally, I also realised that part of the reason we (IOTstack users deploying Mosquitto) see that "not owned by root" warning is because our template starts with an empty pwfile owned by ID=1883. If you actually use mosquitto_passwd to create a password file from scratch, it gets root ownership and mode 600. Then, on the next container restart the root ownership will be reset to 1883 and the "not owned by root" warnings will start up on subsequent runs of mosquitto_passwd.

The reason we see the "world readable permissions" warning is a side-effect of Git which, according to the reading I've been doing, only lets you specify whether a file has the execute bit set or not. There seems to be no way to set mode 600 in the template structure on GitHub and have it persist all the way through.