AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
4.72k stars 154 forks source link

[BUG] my healthcheck started failing after upgrading to v0.8.0 #601

Closed ovizii closed 3 months ago

ovizii commented 3 months ago

Describe the bug This healthcheck has been working for a while but started failing after the upgrade to scrutiny v0.8.0

    healthcheck:
      test: curl -ILfSs http://localhost:8080/api/health  || exit 1
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

If I manually enter the container and execute it, echo$? returns 0 so it clearly is working.

Expected behavior The healthcheck to be working.

Log Files tmp.log

Please also provide the output of docker info

Client: Docker Engine - Community
 Version:    25.0.4
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.24.7
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 51
  Running: 51
  Paused: 0
  Stopped: 0
 Images: 80
 Server Version: 25.0.4
 Storage Driver: zfs
  Zpool: rpool
  Zpool Health: ONLINE
  Parent Dataset: rpool/docker
  Space Used By Parent: 18224963584
  Space Available: 698153603072
  Parent Quota: no
  Compression: on
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 6.5.11-8-pve
 Operating System: Debian GNU/Linux 12 (bookworm)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 62.53GiB
 Name: nas
 ID: a9b276ef-51a5-47fe-b188-38ff9edb9e79
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: ovizii
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Default Address Pools:
   Base: 172.16.0.0/12, Size: 28

WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
stanrc85 commented 3 months ago

I'm having the same problem. Healthcheck fails, but manually running the command in the console works as expected.

image

ovizii commented 3 months ago

Good to hear it's not just me. Btw. the restart you see in the logs is my autoheal container restarting scrutiny because the healthcheck failed.

For what it is worth, I'm also wondering about this line in the logs:

scrutiny | time="2024-03-14T09:41:22+01:00" level=warning msg="Could not get the most recent data points from the database. This is expected to happen only if this is the very first submission of data for the device." type=web

This is definitely not the first time for this device. Also, I have 2 different SDA devices in scrutiny because I swapped around a few drives.

Scared to delete one from scrutiny as when I tried that, scrutiny stopped working.

sherbibv commented 3 months ago

Having the same problem with this health configuration

healthcheck:
      test: curl --connect-timeout 15 --silent --show-error --fail http://localhost:8080/api/health | grep -q 'true'
      interval: 60s
      retries: 5
      timeout: 10s
      start_period: 20s 
AnalogJ commented 3 months ago

@dropsignal was this something you experienced when developing on debian 12 (bookworm) ?

dropsignal commented 3 months ago

@dropsignal was this something you experienced when developing on debian 12 (bookworm) ?

Not this specifically, but it looks like it's related to the unified /usr issue. Healthcheck is trying to execute curl using /bin/sh which doesn't exist anymore. Technically it does on a full installation of Debian 12 because /bin is just a symbolic link to /usr/bin. However, in Debian 12 slim, /bin/sh is gone. If healthcheck were updated to use /usr/bin/sh, then it will work again.

The other option is to put a symbolic link from /bin/sh to /usr/bin/sh in the docker image, but that defeats the purpose of trying to unify everything under /usr/bin if others don't update their path references like they're supposed to.

AnalogJ commented 3 months ago

@ovizii @sherbibv @stanrc85

Can you try changing:

    healthcheck:
      test: curl -ILfSs http://localhost:8080/api/health  || exit 1
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

to

    healthcheck:
      test: /usr/bin/curl -ILfSs http://localhost:8080/api/health  || exit 1
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

or

    healthcheck:
      test:["CMD", "curl", "-ILfSs", "http://localhost:8080/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

and report back?

sherbibv commented 3 months ago

@AnalogJ the second solution worked for me. Thank you!

ovizii commented 3 months ago

The first one didn't work for me either while the second one does, but you need to insert a space after test: or will just error:

test:["CMD", => test: ["CMD",

Thanks for helping us out.