dragonflydb / dragonfly

A modern replacement for Redis and Memcached
https://www.dragonflydb.io/
Other
26.03k stars 957 forks source link

Dragonfly crashes in minimal memory configuration #2739

Open applike-ss opened 8 months ago

applike-ss commented 8 months ago

Describe the bug I have discovered crashes of dragonfly upon testing the cache mode to ensure it is stable before using it in prod workloads. It seems that it is crashing at a random point when putting lots of data

To Reproduce Steps to reproduce the behavior:

  1. create a data.txt for later data ingestion use: dd if=/dev/urandom bs=1024 count=100| base64 > data.txt
  2. now create a script to ingest data and call it redis-pump.sh (put same folder as data.txt):
    
    #!/bin/zsh

CLUSTER_HOSTNAME=$1 MAX_ITERATIONS=$2

i=0

if [ "" = "${MAX_ITERATIONS}" ]; then echo "pushing as much data as possible to ${CLUSTER_HOSTNAME}" else echo "pushing ${MAX_ITERATIONS} chunks of data to ${CLUSTER_HOSTNAME}" fi

DATA="$(cat data.txt)"

while true; do i=$(expr $i + 1)

redis-cli -h "${CLUSTER_HOSTNAME}" -x set x-data-$i < data.txt | grep -v "OK"

if [ $((i % 5)) = 0 ]; then
    echo "pushed $i data chunks to ${CLUSTER_HOSTNAME}"
fi

if [ "$i" = "${MAX_ITERATIONS}" ]; then
    echo "pushed all $i data chunks to ${CLUSTER_HOSTNAME}"
    exit
fi

done

3. in one terminal execute: `docker run -it --rm --cpus 0.1 -m 320M -p 6379:6379 -p 9999:9999 --mount type=tmpfs,destination=/dragonfly/snapshots,tmpfs-size=524288000 ghcr.io/dragonflydb/dragonfly-weekly:e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f-ubuntu --alsologtostderr --primary_port_http_enabled=false --admin_port=9999 --admin_nopass --cache_mode --primary_port_http_enabled=true --cluster_mode=emulated --dir=/dragonfly/snapshots --snapshot_cron="*/5 * * * *"`
4. Wait for dragonfly to run
5. run `./redis-pump.sh 127.0.0.1 1000000` and wait ~7-10 minutes

**Expected behavior**
There should not be any crashes. If there are crashes, there should be a reasonable error/crit message telling why it happened

**Logs**
`dragonfly`:

I20240318 14:37:46.707206 1 init.cc:70] dragonfly running in opt mode.

redis-pump.sh

pushed 5140 data chunks to 127.0.0.1
pushed 5145 data chunks to 127.0.0.1
pushed 5150 data chunks to 127.0.0.1
pushed 5155 data chunks to 127.0.0.1
pushed 5160 data chunks to 127.0.0.1
pushed 5165 data chunks to 127.0.0.1
Error: Server closed the connection
Error: Connection reset by peer
Could not connect to Redis at 127.0.0.1:6379: Connection refused
Could not connect to Redis at 127.0.0.1:6379: Connection refused
Could not connect to Redis at 127.0.0.1:6379: Connection refused
pushed 5170 data chunks to 127.0.0.1

Environment (please complete the following information):

Reproducible Code Snippet

See above

Is there any additional information i can provide or an option to let dragonfly run with debug logging for gathering more interesting information?

applike-ss commented 8 months ago

Removing dir and snapshot_cron parameters btw. does not show this behaviour. I have it running without these parameters and pumping data into it for more than 6 hours already. I could imagine a memory related issue that makes containerd just kill the container, e.g. creating the snapshot + the data that is in memory for storing the kv data runs into the limit that i set for the container (320Mi).

romange commented 8 months ago

you can easily verify it @applike-ss by raising the limit and checking used_memory_peak_rss with "INFO MEMORY" command

romange commented 8 months ago

Is there anything we can help here @applike-ss ?

@chakaz probably a good use-case to focus on, when switch working on RSS instrumentation. Not urgent though.

applike-ss commented 8 months ago

you can easily verify it @applike-ss by raising the limit and checking used_memory_peak_rss with "INFO MEMORY" command

I'm doing that right now. I assume you want me to watch "redis-cli info memory | grep used_memory_peak_rss", correct? Otherwise let me know how i'm supposed to test here. I will post results once i got them.

Is there anything we can help here @applike-ss ?

@chakaz probably a good use-case to focus on, when switch working on RSS instrumentation. Not urgent though.

Sorry, was on vacation and couldn't check back here.

applike-ss commented 8 months ago

Observed memory usages:

Limited to 320Mi (256Mi maxmemory, crashing): Output:

I20240402 06:52:46.653187     1 dfly_main.cc:627] Starting dragonfly df-e8650ed-e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f
W20240402 06:52:46.653295     1 dfly_main.cc:666] SWAP is enabled. Consider disabling it when running Dragonfly.
I20240402 06:52:46.653316     1 dfly_main.cc:671] maxmemory has not been specified. Deciding myself....
I20240402 06:52:46.653319     1 dfly_main.cc:680] Found 320.00MiB available memory. Setting maxmemory to 256.00MiB
I20240402 06:52:46.654233     8 uring_proactor.cc:172] IORing with 1024 entries, allocated 102464 bytes, cq_entries is 2048
I20240402 06:52:46.654599     1 proactor_pool.cc:146] Running 1 io threads
I20240402 06:52:46.720907     1 dfly_main.cc:258] Listening on admin socket any:9999
I20240402 06:52:46.721390     1 server_family.cc:665] Host OS: Linux 6.6.16-linuxkit aarch64 with 1 threads
I20240402 06:52:46.721801     1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
W20240402 06:52:46.721869     1 server_family.cc:758] Load snapshot: No snapshot found
I20240402 06:52:46.722051     8 listener_interface.cc:122] sock[5] AcceptServer - listening on port 9999
I20240402 06:52:46.722074     8 listener_interface.cc:122] sock[6] AcceptServer - listening on port 6379
I20240402 06:53:00.228514     8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:53:00-summary.dfs" finished after 0 us
I20240402 06:54:02.930020     8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:54:00-summary.dfs" finished after 2 s

used_memory_peak_rss:320741376

Limited to 640Mi (512Mi maxmemory, not crashing): Output:

I20240402 06:56:57.617758     1 init.cc:70] dragonfly running in opt mode.
* Logs will be written to the first available of the following paths:
/tmp/dragonfly.*
./dragonfly.*
* For the available flags type dragonfly [--help | --helpfull]
* Documentation can be found at: https://www.dragonflydb.io/docs
I20240402 06:56:57.617822     1 dfly_main.cc:627] Starting dragonfly df-e8650ed-e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f
W20240402 06:56:57.617899     1 dfly_main.cc:666] SWAP is enabled. Consider disabling it when running Dragonfly.
I20240402 06:56:57.617913     1 dfly_main.cc:671] maxmemory has not been specified. Deciding myself....
I20240402 06:56:57.617915     1 dfly_main.cc:680] Found 640.00MiB available memory. Setting maxmemory to 512.00MiB
I20240402 06:56:57.618276     8 uring_proactor.cc:172] IORing with 1024 entries, allocated 102464 bytes, cq_entries is 2048
I20240402 06:56:57.618389     1 proactor_pool.cc:146] Running 1 io threads
I20240402 06:56:57.619359     1 dfly_main.cc:258] Listening on admin socket any:9999
I20240402 06:56:57.619737     1 server_family.cc:665] Host OS: Linux 6.6.16-linuxkit aarch64 with 1 threads
I20240402 06:56:57.619998     1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
W20240402 06:56:57.620031     1 server_family.cc:758] Load snapshot: No snapshot found
I20240402 06:56:57.620177     8 listener_interface.cc:122] sock[5] AcceptServer - listening on port 9999
I20240402 06:56:57.620189     8 listener_interface.cc:122] sock[6] AcceptServer - listening on port 6379
I20240402 06:57:00.002776     8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:57:00-summary.dfs" finished after 0 us
I20240402 06:58:01.918434     8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:58:00-summary.dfs" finished after 1 s
I20240402 06:59:04.121477     8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:59:00-summary.dfs" finished after 4 s
E20240402 07:00:09.115413     8 rdb_save.cc:1406] io error system:28
W20240402 07:00:09.325433     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:01:09.622743     8 rdb_save.cc:1406] io error system:28
W20240402 07:01:09.825254     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:02:13.933493     8 rdb_save.cc:1406] io error system:28
W20240402 07:02:14.120755     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:03:17.126403     8 rdb_save.cc:1406] io error system:28
W20240402 07:03:17.223644     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:04:19.825512     8 rdb_save.cc:1406] io error system:28
W20240402 07:04:20.023314     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:05:16.028622     8 rdb_save.cc:1406] io error system:28
W20240402 07:05:16.226780     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:06:20.922616     8 rdb_save.cc:1406] io error system:28
W20240402 07:06:21.125222     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:07:19.232414     8 rdb_save.cc:1406] io error system:28
W20240402 07:07:19.429133     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:08:19.621541     8 rdb_save.cc:1406] io error system:28
W20240402 07:08:19.833389     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:09:15.229292     8 rdb_save.cc:1406] io error system:28
W20240402 07:09:15.426709     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:10:18.745149     8 rdb_save.cc:1406] io error system:28
W20240402 07:10:19.034804     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:11:20.337744     8 rdb_save.cc:1406] io error system:28
W20240402 07:11:20.622592     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:12:20.823269     8 rdb_save.cc:1406] io error system:28
W20240402 07:12:21.012964     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:13:20.706354     8 rdb_save.cc:1406] io error system:28
W20240402 07:13:20.903396     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:14:19.709885     8 rdb_save.cc:1406] io error system:28
W20240402 07:14:20.008428     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:15:19.915210     8 rdb_save.cc:1406] io error system:28
W20240402 07:15:20.305893     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:16:21.213469     8 rdb_save.cc:1406] io error system:28
W20240402 07:16:21.406510     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:17:17.004458     8 rdb_save.cc:1406] io error system:28
W20240402 07:17:17.214521     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:18:19.917520     8 rdb_save.cc:1406] io error system:28
W20240402 07:18:20.219050     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:19:20.011533     8 rdb_save.cc:1406] io error system:28
W20240402 07:19:20.309616     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:20:20.009765     8 rdb_save.cc:1406] io error system:28
W20240402 07:20:20.310956     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:21:18.818560     8 rdb_save.cc:1406] io error system:28
W20240402 07:21:19.203550     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:22:20.719043     8 rdb_save.cc:1406] io error system:28
W20240402 07:22:21.209784     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:23:21.320616     8 rdb_save.cc:1406] io error system:28
W20240402 07:23:21.611627     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:24:21.321049     8 rdb_save.cc:1406] io error system:28
W20240402 07:24:21.622252     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:25:20.007305     8 rdb_save.cc:1406] io error system:28
W20240402 07:25:20.309065     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:26:19.608237     8 rdb_save.cc:1406] io error system:28
W20240402 07:26:19.921449     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:27:20.921967     8 rdb_save.cc:1406] io error system:28
W20240402 07:27:21.305016     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:28:21.204382     8 rdb_save.cc:1406] io error system:28
W20240402 07:28:21.512039     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:29:20.717458     8 rdb_save.cc:1406] io error system:28
W20240402 07:29:21.213011     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:30:21.419281     8 rdb_save.cc:1406] io error system:28
W20240402 07:30:21.610126     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:31:19.925829     8 rdb_save.cc:1406] io error system:28
W20240402 07:31:20.208326     8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:32:20.812709     8 rdb_save.cc:1406] io error system:28
W20240402 07:32:21.014976     8 server_family.cc:918] Failed to perform snapshot No space left on device
^CI20240402 07:33:16.215328     8 accept_server.cc:24] Exiting on signal Interrupt
E20240402 07:33:16.218310     8 rdb_save.cc:1406] io error system:28
I20240402 07:33:16.221161     8 listener_interface.cc:222] Listener stopped for port 9999
W20240402 07:33:16.413079     8 server_family.cc:918] Failed to perform snapshot No space left on device
I20240402 07:33:16.413216     8 listener_interface.cc:222] Listener stopped for port 6379
E20240402 07:33:35.016315     8 rdb_save.cc:1406] io error system:28
W20240402 07:33:35.311447     8 server_family.cc:784] Failed to perform snapshot No space left on device

used_memory_peak_rss:668233728

My assumption is that due to dragonfly running in cache mode, it should never go OOM - so no crash should happen. If it was not running in cache mode, i would expect it to write out OOM responses if i write data to it when current data + incoming data > maxmemory and also not crash.

chakaz commented 7 months ago

@applike-ss we plan to take into account the additional memory of serializing to file (and also for new replica connecting in full sync stage), and perhaps fail / slow down these actions when memory is tight. Also, we plan to migrate from using use-memory (the memory used only by the data) and start using rss (actual memory usage). Both actions will take at the very least a few weeks, likely more, so in the meanwhile you could add some memory buffer for these operations specifically.

applike-ss commented 4 months ago

With the current version (docker.dragonflydb.io/dragonflydb/dragonfly:v1.20.1), we have to use 430Mi when using max-memory of 320M and doing snapshots every 5 minutes and pushing 1mb data to redis all the time in cache-mode.

Still waiting on dragonfly to exit gracefully when this kind of issue happens. From the logs i don't see it shut down when crashing due to OOM situation and no error log for memory allocation or similar. At least a message would be nice that let's us easier see from logs why it crashed.

romange commented 4 months ago

Unfortunately based on the information we have here, we can not create any constructive action items here. We need to have a way to reproduce it before we try fixing it 🤷🏼

applike-ss commented 4 months ago

I did provide a demo already in the first message of this issue, however here's another one for k8s.

Here's the steps for reproduction:

Create K3D Cluster with port forwarding:

k3d cluster create -p 127.0.0.1:6379:6379

Install the dragonfly operator:

kubectl apply -f https://raw.githubusercontent.com/dragonflydb/dragonfly-operator/main/manifests/dragonfly-operator.yaml

CR:

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  labels:
    app.kubernetes.io/name: dragonfly
    app.kubernetes.io/instance: dragonfly-sample
    app.kubernetes.io/part-of: dragonfly-operator
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: dragonfly-operator
  name: dragonfly
spec:
  replicas: 3
  args:
    - --maxmemory=256M
    - --alsologtostderr
    - --primary_port_http_enabled=false
    - --admin_port=9999
    - --admin_nopass
    - --cache_mode
    - --primary_port_http_enabled=true
    - --cluster_mode=emulated
  snapshot:
    cron: "*/5 * * * *"
    persistentVolumeClaimSpec:
      resources:
        requests:
          storage: 1Gi
      accessModes:
        - ReadWriteOnce
  resources:
    requests:
      cpu: 500m
      memory: 320Mi
    limits:
      cpu: 500m
      memory: 320Mi
  serviceSpec:
    type: LoadBalancer
applike-ss commented 2 months ago

I just re-checked it with v1.22.1 and the problem still exists.