dragonflydb / dragonfly

A modern replacement for Redis and Memcached
https://www.dragonflydb.io/
Other
25.38k stars 917 forks source link

Dragonfly Docker image v1.16.0 healtcheck is broken #2841

Closed turbotimon closed 5 months ago

turbotimon commented 5 months ago

Describe the bug After updating (freshly pulling) the latest image (v1.16.0), docker container crashes (during start) with the following log.

I20240404 10:51:11.817468     1 init.cc:70] dragonfly running in opt mode.
I20240404 10:51:11.817705     1 dfly_main.cc:641] Starting dragonfly df-v1.16.0-8bd35754de9ae49908369961634dad0b7fbea878
* Logs will be written to the first available of the following paths:
/tmp/dragonfly.*
./dragonfly.*
* For the available flags type dragonfly [--help | --helpfull]
* Documentation can be found at: https://www.dragonflydb.io/docs
W20240404 10:51:11.818120     1 dfly_main.cc:680] SWAP is enabled. Consider disabling it when running Dragonfly.
I20240404 10:51:11.818161     1 dfly_main.cc:685] maxmemory has not been specified. Deciding myself....
I20240404 10:51:11.818197     1 dfly_main.cc:694] Found 460.68GiB available memory. Setting maxmemory to 368.54GiB
W20240404 10:51:11.818303     1 dfly_main.cc:368] Weird error 1 switching to epoll
I20240404 10:51:11.908849     1 proactor_pool.cc:146] Running 48 io threads
I20240404 10:51:11.980602     1 server_family.cc:713] Host OS: Linux 5.15.0-101-generic x86_64 with 48 threads
I20240404 10:51:12.010690     1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/data"
W20240404 10:51:12.010876     1 server_family.cc:806] Load snapshot: No snapshot found
I20240404 10:51:12.038388     9 listener_interface.cc:101] sock[99] AcceptServer - listening on port 6379
I20240404 10:52:44.101867     8 accept_server.cc:24] Exiting on signal Terminated
I20240404 10:52:44.103591     9 listener_interface.cc:201] Listener stopped for port 6379
I20240404 10:52:44.149309    13 save_stages_controller.cc:321] Saving "dump-2024-04-04T10:52:44-summary.dfs" finished after 0 us

The workaround was pinning the image back to v1.15.1, then it starts normal again Not sure if related to #2739

To Reproduce Start docker compose with this config

version: '3.9'
services:

  dragonfly:
    image: docker.dragonflydb.io/dragonflydb/dragonfly #latest is v1.16.0 at this time
    # image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.15.1 #fixes the problem
    restart: unless-stopped
    ulimits:
      memlock: -1
    volumes:
      - pretalx-redis:/data

  # Other services...

volumes:
  pretalx-redis:
  # Other volumes not related to dragonfly ...

Expected behavior Container won't crash

Environment (please complete the following information):

catataw commented 5 months ago

redis: image: 'docker.dragonflydb.io/dragonflydb/dragonfly' entrypoint:

I20240404 10:54:28.045588 1 init.cc:70] dragonfly running in opt mode. I20240404 10:54:28.045682 1 dfly_main.cc:641] Starting dragonfly df-v1.16.0-8bd35754de9ae49908369961634dad0b7fbea878

romange commented 5 months ago

Dragonfly does not crash here. Based on the logs it shuts down in an orderly fashion. Seems that the healthcheck in docker got screwed up. Probably this PR: https://github.com/dragonflydb/dragonfly/pull/2659 I am checking

Abhra303 commented 5 months ago

It is because the healthcheck script uses the last entry port (returned by netstart -tuln) to check the health of the dragonfly process. @turbotimon could you please run netstat -tuln on the container and share the output here?

You can set the HEALTHCHECK_PORT env to your dragonfly port (6379 by default) in the container the fix the issue.

romange commented 5 months ago

Sorry about this. The workaround is to add environment: HEALTHCHECK_PORT: 6379 like this:

version: '3.8'
services:
  dragonfly:
    image: 'docker.dragonflydb.io/dragonflydb/dragonfly:v1.16.0'
    restart: unless-stopped
    environment:
      HEALTHCHECK_PORT: 6379
    ulimits:
      memlock: -1
    ports:
      - "6379:6379"

we will release a patch for this next week. If you want to help us please fix tools/docker/healthcheck.sh script, specifically, netstat -tuln | grep -oE ':[0-9]+' | grep -oE '[0-9]+' | tail -n 1 returns the wrong port

Abhra303 commented 5 months ago

we will release a patch for this next week. If you want to help us please fix tools/docker/healthcheck.sh script, specifically, netstat -tuln | grep -oE ':[0-9]+' | grep -oE '[0-9]+' | tail -n 1 returns the wrong port

It doesn't always return the wrong port. For me it passed locally. Depends on last entry returned by netstat (don't know how it sorts the list; may be by last activity/establishment order)

turbotimon commented 5 months ago

@Abhra303 Thank! And sorry, i was away for the weekend. Do you still need this (as it should be fixed now)?

turbotimon could you please run netstat -tuln on the container and share the output here?

romange commented 5 months ago

@turbotimon it will be released in v1.16.1