Open applike-ss opened 8 months ago
Removing dir
and snapshot_cron
parameters btw. does not show this behaviour.
I have it running without these parameters and pumping data into it for more than 6 hours already.
I could imagine a memory related issue that makes containerd just kill the container, e.g. creating the snapshot + the data that is in memory for storing the kv data runs into the limit that i set for the container (320Mi).
you can easily verify it @applike-ss by raising the limit and checking used_memory_peak_rss
with "INFO MEMORY" command
Is there anything we can help here @applike-ss ?
@chakaz probably a good use-case to focus on, when switch working on RSS instrumentation. Not urgent though.
you can easily verify it @applike-ss by raising the limit and checking
used_memory_peak_rss
with "INFO MEMORY" command
I'm doing that right now. I assume you want me to watch "redis-cli info memory | grep used_memory_peak_rss"
, correct? Otherwise let me know how i'm supposed to test here.
I will post results once i got them.
Is there anything we can help here @applike-ss ?
@chakaz probably a good use-case to focus on, when switch working on RSS instrumentation. Not urgent though.
Sorry, was on vacation and couldn't check back here.
Observed memory usages:
Limited to 320Mi (256Mi maxmemory, crashing): Output:
I20240402 06:52:46.653187 1 dfly_main.cc:627] Starting dragonfly df-e8650ed-e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f
W20240402 06:52:46.653295 1 dfly_main.cc:666] SWAP is enabled. Consider disabling it when running Dragonfly.
I20240402 06:52:46.653316 1 dfly_main.cc:671] maxmemory has not been specified. Deciding myself....
I20240402 06:52:46.653319 1 dfly_main.cc:680] Found 320.00MiB available memory. Setting maxmemory to 256.00MiB
I20240402 06:52:46.654233 8 uring_proactor.cc:172] IORing with 1024 entries, allocated 102464 bytes, cq_entries is 2048
I20240402 06:52:46.654599 1 proactor_pool.cc:146] Running 1 io threads
I20240402 06:52:46.720907 1 dfly_main.cc:258] Listening on admin socket any:9999
I20240402 06:52:46.721390 1 server_family.cc:665] Host OS: Linux 6.6.16-linuxkit aarch64 with 1 threads
I20240402 06:52:46.721801 1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
W20240402 06:52:46.721869 1 server_family.cc:758] Load snapshot: No snapshot found
I20240402 06:52:46.722051 8 listener_interface.cc:122] sock[5] AcceptServer - listening on port 9999
I20240402 06:52:46.722074 8 listener_interface.cc:122] sock[6] AcceptServer - listening on port 6379
I20240402 06:53:00.228514 8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:53:00-summary.dfs" finished after 0 us
I20240402 06:54:02.930020 8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:54:00-summary.dfs" finished after 2 s
used_memory_peak_rss:320741376
Limited to 640Mi (512Mi maxmemory, not crashing): Output:
I20240402 06:56:57.617758 1 init.cc:70] dragonfly running in opt mode.
* Logs will be written to the first available of the following paths:
/tmp/dragonfly.*
./dragonfly.*
* For the available flags type dragonfly [--help | --helpfull]
* Documentation can be found at: https://www.dragonflydb.io/docs
I20240402 06:56:57.617822 1 dfly_main.cc:627] Starting dragonfly df-e8650ed-e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f
W20240402 06:56:57.617899 1 dfly_main.cc:666] SWAP is enabled. Consider disabling it when running Dragonfly.
I20240402 06:56:57.617913 1 dfly_main.cc:671] maxmemory has not been specified. Deciding myself....
I20240402 06:56:57.617915 1 dfly_main.cc:680] Found 640.00MiB available memory. Setting maxmemory to 512.00MiB
I20240402 06:56:57.618276 8 uring_proactor.cc:172] IORing with 1024 entries, allocated 102464 bytes, cq_entries is 2048
I20240402 06:56:57.618389 1 proactor_pool.cc:146] Running 1 io threads
I20240402 06:56:57.619359 1 dfly_main.cc:258] Listening on admin socket any:9999
I20240402 06:56:57.619737 1 server_family.cc:665] Host OS: Linux 6.6.16-linuxkit aarch64 with 1 threads
I20240402 06:56:57.619998 1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
W20240402 06:56:57.620031 1 server_family.cc:758] Load snapshot: No snapshot found
I20240402 06:56:57.620177 8 listener_interface.cc:122] sock[5] AcceptServer - listening on port 9999
I20240402 06:56:57.620189 8 listener_interface.cc:122] sock[6] AcceptServer - listening on port 6379
I20240402 06:57:00.002776 8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:57:00-summary.dfs" finished after 0 us
I20240402 06:58:01.918434 8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:58:00-summary.dfs" finished after 1 s
I20240402 06:59:04.121477 8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:59:00-summary.dfs" finished after 4 s
E20240402 07:00:09.115413 8 rdb_save.cc:1406] io error system:28
W20240402 07:00:09.325433 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:01:09.622743 8 rdb_save.cc:1406] io error system:28
W20240402 07:01:09.825254 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:02:13.933493 8 rdb_save.cc:1406] io error system:28
W20240402 07:02:14.120755 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:03:17.126403 8 rdb_save.cc:1406] io error system:28
W20240402 07:03:17.223644 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:04:19.825512 8 rdb_save.cc:1406] io error system:28
W20240402 07:04:20.023314 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:05:16.028622 8 rdb_save.cc:1406] io error system:28
W20240402 07:05:16.226780 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:06:20.922616 8 rdb_save.cc:1406] io error system:28
W20240402 07:06:21.125222 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:07:19.232414 8 rdb_save.cc:1406] io error system:28
W20240402 07:07:19.429133 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:08:19.621541 8 rdb_save.cc:1406] io error system:28
W20240402 07:08:19.833389 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:09:15.229292 8 rdb_save.cc:1406] io error system:28
W20240402 07:09:15.426709 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:10:18.745149 8 rdb_save.cc:1406] io error system:28
W20240402 07:10:19.034804 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:11:20.337744 8 rdb_save.cc:1406] io error system:28
W20240402 07:11:20.622592 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:12:20.823269 8 rdb_save.cc:1406] io error system:28
W20240402 07:12:21.012964 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:13:20.706354 8 rdb_save.cc:1406] io error system:28
W20240402 07:13:20.903396 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:14:19.709885 8 rdb_save.cc:1406] io error system:28
W20240402 07:14:20.008428 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:15:19.915210 8 rdb_save.cc:1406] io error system:28
W20240402 07:15:20.305893 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:16:21.213469 8 rdb_save.cc:1406] io error system:28
W20240402 07:16:21.406510 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:17:17.004458 8 rdb_save.cc:1406] io error system:28
W20240402 07:17:17.214521 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:18:19.917520 8 rdb_save.cc:1406] io error system:28
W20240402 07:18:20.219050 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:19:20.011533 8 rdb_save.cc:1406] io error system:28
W20240402 07:19:20.309616 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:20:20.009765 8 rdb_save.cc:1406] io error system:28
W20240402 07:20:20.310956 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:21:18.818560 8 rdb_save.cc:1406] io error system:28
W20240402 07:21:19.203550 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:22:20.719043 8 rdb_save.cc:1406] io error system:28
W20240402 07:22:21.209784 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:23:21.320616 8 rdb_save.cc:1406] io error system:28
W20240402 07:23:21.611627 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:24:21.321049 8 rdb_save.cc:1406] io error system:28
W20240402 07:24:21.622252 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:25:20.007305 8 rdb_save.cc:1406] io error system:28
W20240402 07:25:20.309065 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:26:19.608237 8 rdb_save.cc:1406] io error system:28
W20240402 07:26:19.921449 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:27:20.921967 8 rdb_save.cc:1406] io error system:28
W20240402 07:27:21.305016 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:28:21.204382 8 rdb_save.cc:1406] io error system:28
W20240402 07:28:21.512039 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:29:20.717458 8 rdb_save.cc:1406] io error system:28
W20240402 07:29:21.213011 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:30:21.419281 8 rdb_save.cc:1406] io error system:28
W20240402 07:30:21.610126 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:31:19.925829 8 rdb_save.cc:1406] io error system:28
W20240402 07:31:20.208326 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:32:20.812709 8 rdb_save.cc:1406] io error system:28
W20240402 07:32:21.014976 8 server_family.cc:918] Failed to perform snapshot No space left on device
^CI20240402 07:33:16.215328 8 accept_server.cc:24] Exiting on signal Interrupt
E20240402 07:33:16.218310 8 rdb_save.cc:1406] io error system:28
I20240402 07:33:16.221161 8 listener_interface.cc:222] Listener stopped for port 9999
W20240402 07:33:16.413079 8 server_family.cc:918] Failed to perform snapshot No space left on device
I20240402 07:33:16.413216 8 listener_interface.cc:222] Listener stopped for port 6379
E20240402 07:33:35.016315 8 rdb_save.cc:1406] io error system:28
W20240402 07:33:35.311447 8 server_family.cc:784] Failed to perform snapshot No space left on device
used_memory_peak_rss:668233728
My assumption is that due to dragonfly running in cache mode, it should never go OOM - so no crash should happen. If it was not running in cache mode, i would expect it to write out OOM responses if i write data to it when current data + incoming data > maxmemory and also not crash.
@applike-ss we plan to take into account the additional memory of serializing to file (and also for new replica connecting in full sync stage), and perhaps fail / slow down these actions when memory is tight. Also, we plan to migrate from using use-memory (the memory used only by the data) and start using rss (actual memory usage). Both actions will take at the very least a few weeks, likely more, so in the meanwhile you could add some memory buffer for these operations specifically.
With the current version (docker.dragonflydb.io/dragonflydb/dragonfly:v1.20.1), we have to use 430Mi when using max-memory of 320M and doing snapshots every 5 minutes and pushing 1mb data to redis all the time in cache-mode.
Still waiting on dragonfly to exit gracefully when this kind of issue happens. From the logs i don't see it shut down when crashing due to OOM situation and no error log for memory allocation or similar. At least a message would be nice that let's us easier see from logs why it crashed.
Unfortunately based on the information we have here, we can not create any constructive action items here. We need to have a way to reproduce it before we try fixing it 🤷🏼
I did provide a demo already in the first message of this issue, however here's another one for k8s.
Here's the steps for reproduction:
./redis-pump.sh 127.0.0.1 6379 1000000
Create K3D Cluster with port forwarding:
k3d cluster create -p 127.0.0.1:6379:6379
Install the dragonfly operator:
kubectl apply -f https://raw.githubusercontent.com/dragonflydb/dragonfly-operator/main/manifests/dragonfly-operator.yaml
CR:
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
labels:
app.kubernetes.io/name: dragonfly
app.kubernetes.io/instance: dragonfly-sample
app.kubernetes.io/part-of: dragonfly-operator
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/created-by: dragonfly-operator
name: dragonfly
spec:
replicas: 3
args:
- --maxmemory=256M
- --alsologtostderr
- --primary_port_http_enabled=false
- --admin_port=9999
- --admin_nopass
- --cache_mode
- --primary_port_http_enabled=true
- --cluster_mode=emulated
snapshot:
cron: "*/5 * * * *"
persistentVolumeClaimSpec:
resources:
requests:
storage: 1Gi
accessModes:
- ReadWriteOnce
resources:
requests:
cpu: 500m
memory: 320Mi
limits:
cpu: 500m
memory: 320Mi
serviceSpec:
type: LoadBalancer
I just re-checked it with v1.22.1 and the problem still exists.
Describe the bug I have discovered crashes of dragonfly upon testing the cache mode to ensure it is stable before using it in prod workloads. It seems that it is crashing at a random point when putting lots of data
To Reproduce Steps to reproduce the behavior:
dd if=/dev/urandom bs=1024 count=100| base64 > data.txt
redis-pump.sh
(put same folder as data.txt):CLUSTER_HOSTNAME=$1 MAX_ITERATIONS=$2
i=0
if [ "" = "${MAX_ITERATIONS}" ]; then echo "pushing as much data as possible to ${CLUSTER_HOSTNAME}" else echo "pushing ${MAX_ITERATIONS} chunks of data to ${CLUSTER_HOSTNAME}" fi
DATA="$(cat data.txt)"
while true; do i=$(expr $i + 1)
done
I20240318 14:37:46.707206 1 init.cc:70] dragonfly running in opt mode.
redis-pump.sh
Environment (please complete the following information):
Reproducible Code Snippet
Is there any additional information i can provide or an option to let dragonfly run with debug logging for gathering more interesting information?