Dragonfly Docker image v1.16.0 healtcheck is broken

turbotimon commented 5 months ago

Describe the bug After updating (freshly pulling) the latest image (v1.16.0), docker container crashes (during start) with the following log.

I20240404 10:51:11.817468     1 init.cc:70] dragonfly running in opt mode.
I20240404 10:51:11.817705     1 dfly_main.cc:641] Starting dragonfly df-v1.16.0-8bd35754de9ae49908369961634dad0b7fbea878
* Logs will be written to the first available of the following paths:
/tmp/dragonfly.*
./dragonfly.*
* For the available flags type dragonfly [--help | --helpfull]
* Documentation can be found at: https://www.dragonflydb.io/docs
W20240404 10:51:11.818120     1 dfly_main.cc:680] SWAP is enabled. Consider disabling it when running Dragonfly.
I20240404 10:51:11.818161     1 dfly_main.cc:685] maxmemory has not been specified. Deciding myself....
I20240404 10:51:11.818197     1 dfly_main.cc:694] Found 460.68GiB available memory. Setting maxmemory to 368.54GiB
W20240404 10:51:11.818303     1 dfly_main.cc:368] Weird error 1 switching to epoll
I20240404 10:51:11.908849     1 proactor_pool.cc:146] Running 48 io threads
I20240404 10:51:11.980602     1 server_family.cc:713] Host OS: Linux 5.15.0-101-generic x86_64 with 48 threads
I20240404 10:51:12.010690     1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/data"
W20240404 10:51:12.010876     1 server_family.cc:806] Load snapshot: No snapshot found
I20240404 10:51:12.038388     9 listener_interface.cc:101] sock[99] AcceptServer - listening on port 6379
I20240404 10:52:44.101867     8 accept_server.cc:24] Exiting on signal Terminated
I20240404 10:52:44.103591     9 listener_interface.cc:201] Listener stopped for port 6379
I20240404 10:52:44.149309    13 save_stages_controller.cc:321] Saving "dump-2024-04-04T10:52:44-summary.dfs" finished after 0 us

The workaround was pinning the image back to v1.15.1, then it starts normal again Not sure if related to #2739

To Reproduce Start docker compose with this config

version: '3.9'
services:

  dragonfly:
    image: docker.dragonflydb.io/dragonflydb/dragonfly #latest is v1.16.0 at this time
    # image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.15.1 #fixes the problem
    restart: unless-stopped
    ulimits:
      memlock: -1
    volumes:
      - pretalx-redis:/data

  # Other services...

volumes:
  pretalx-redis:
  # Other volumes not related to dragonfly ...

Expected behavior Container won't crash

Environment (please complete the following information):

Containerized?: Docker Compose (via Portainer)
Dragonfly Version: v1.16.0

catataw commented 5 months ago

redis: image: 'docker.dragonflydb.io/dragonflydb/dragonfly' entrypoint:

dragonfly
--logtostderr
--dbnum=1
--requirepass=xxxxxxxxxxx
--snapshot_cron=/1 *
--dbfilename=dump.rdb
--nodf_snapshot_format ulimits: memlock: soft: -1 hard: -1 nofile: soft: 65536 hard: 65536 ports:
target: 6379 published: 6379 protocol: tcp mode: host volumes:
dragonfly-data:/data restart: always network_mode: "host" deploy: replicas: 1 placement: constraints:
- node.hostname == xxxxxx update_config: parallelism: 1 delay: 5s restart_policy: condition: on-failure

I20240404 10:54:28.045588 1 init.cc:70] dragonfly running in opt mode. I20240404 10:54:28.045682 1 dfly_main.cc:641] Starting dragonfly df-v1.16.0-8bd35754de9ae49908369961634dad0b7fbea878

Logs will be written to the first available of the following paths: /tmp/dragonfly. ./dragonfly.
For the available flags type dragonfly [--help | --helpfull]
Documentation can be found at: https://www.dragonflydb.io/docs W20240404 10:54:28.045791 1 dfly_main.cc:680] SWAP is enabled. Consider disabling it when running Dragonfly. I20240404 10:54:28.045804 1 dfly_main.cc:685] maxmemory has not been specified. Deciding myself.... I20240404 10:54:28.045809 1 dfly_main.cc:694] Found 14.86GiB available memory. Setting maxmemory to 11.89GiB W20240404 10:54:28.045847 1 dfly_main.cc:368] Weird error 1 switching to epoll I20240404 10:54:28.124007 1 proactor_pool.cc:146] Running 4 io threads I20240404 10:54:28.126219 1 server_family.cc:713] Host OS: Linux 5.15.0-101-generic x86_64 with 4 threads I20240404 10:54:28.127036 1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/data" I20240404 10:54:28.127077 1 server_family.cc:877] Loading /data/dump.rdb I20240404 10:54:28.133761 8 listener_interface.cc:101] sock[11] AcceptServer - listening on port 6379 I20240404 10:54:29.134217 7 server_family.cc:931] Load finished, num keys read: 34753 I20240404 10:55:02.990330 8 save_stages_controller.cc:321] Saving "dump" finished after 2 s I20240404 10:56:00.233101 7 accept_server.cc:24] Exiting on signal Terminated W20240404 10:56:00.331022 8 dragonfly_listener.cc:304] Some commands are still being dispatched but didn't conclude in time. Proceeding in shutdown. I20240404 10:56:00.471227 8 listener_interface.cc:201] Listener stopped for port 6379 I20240404 10:56:02.943097 8 save_stages_controller.cc:321] Saving "dump" finished after 2 s I20240404 10:56:05.799029 10 save_stages_controller.cc:321] Saving "dump" finished after 2 s

romange commented 5 months ago

Dragonfly does not crash here. Based on the logs it shuts down in an orderly fashion. Seems that the healthcheck in docker got screwed up. Probably this PR: https://github.com/dragonflydb/dragonfly/pull/2659 I am checking

Abhra303 commented 5 months ago

It is because the healthcheck script uses the last entry port (returned by netstart -tuln) to check the health of the dragonfly process. @turbotimon could you please run netstat -tuln on the container and share the output here?

You can set the HEALTHCHECK_PORT env to your dragonfly port (6379 by default) in the container the fix the issue.

romange commented 5 months ago

Sorry about this. The workaround is to add environment: HEALTHCHECK_PORT: 6379 like this:

version: '3.8'
services:
  dragonfly:
    image: 'docker.dragonflydb.io/dragonflydb/dragonfly:v1.16.0'
    restart: unless-stopped
    environment:
      HEALTHCHECK_PORT: 6379
    ulimits:
      memlock: -1
    ports:
      - "6379:6379"

we will release a patch for this next week. If you want to help us please fix tools/docker/healthcheck.sh script, specifically, netstat -tuln | grep -oE ':[0-9]+' | grep -oE '[0-9]+' | tail -n 1 returns the wrong port

Abhra303 commented 5 months ago

we will release a patch for this next week. If you want to help us please fix tools/docker/healthcheck.sh script, specifically, netstat -tuln | grep -oE ':[0-9]+' | grep -oE '[0-9]+' | tail -n 1 returns the wrong port

It doesn't always return the wrong port. For me it passed locally. Depends on last entry returned by netstat (don't know how it sorts the list; may be by last activity/establishment order)

turbotimon commented 5 months ago

@Abhra303 Thank! And sorry, i was away for the weekend. Do you still need this (as it should be fixed now)?

turbotimon could you please run netstat -tuln on the container and share the output here?

romange commented 5 months ago

@turbotimon it will be released in v1.16.1

dragonflydb / dragonfly

Dragonfly Docker image v1.16.0 healtcheck is broken #2841