Open applike-ss opened 3 months ago
Removing dir
and snapshot_cron
parameters btw. does not show this behaviour.
I have it running without these parameters and pumping data into it for more than 6 hours already.
I could imagine a memory related issue that makes containerd just kill the container, e.g. creating the snapshot + the data that is in memory for storing the kv data runs into the limit that i set for the container (320Mi).
you can easily verify it @applike-ss by raising the limit and checking used_memory_peak_rss
with "INFO MEMORY" command
Is there anything we can help here @applike-ss ?
@chakaz probably a good use-case to focus on, when switch working on RSS instrumentation. Not urgent though.
you can easily verify it @applike-ss by raising the limit and checking
used_memory_peak_rss
with "INFO MEMORY" command
I'm doing that right now. I assume you want me to watch "redis-cli info memory | grep used_memory_peak_rss"
, correct? Otherwise let me know how i'm supposed to test here.
I will post results once i got them.
Is there anything we can help here @applike-ss ?
@chakaz probably a good use-case to focus on, when switch working on RSS instrumentation. Not urgent though.
Sorry, was on vacation and couldn't check back here.
Observed memory usages:
Limited to 320Mi (256Mi maxmemory, crashing): Output:
I20240402 06:52:46.653187 1 dfly_main.cc:627] Starting dragonfly df-e8650ed-e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f
W20240402 06:52:46.653295 1 dfly_main.cc:666] SWAP is enabled. Consider disabling it when running Dragonfly.
I20240402 06:52:46.653316 1 dfly_main.cc:671] maxmemory has not been specified. Deciding myself....
I20240402 06:52:46.653319 1 dfly_main.cc:680] Found 320.00MiB available memory. Setting maxmemory to 256.00MiB
I20240402 06:52:46.654233 8 uring_proactor.cc:172] IORing with 1024 entries, allocated 102464 bytes, cq_entries is 2048
I20240402 06:52:46.654599 1 proactor_pool.cc:146] Running 1 io threads
I20240402 06:52:46.720907 1 dfly_main.cc:258] Listening on admin socket any:9999
I20240402 06:52:46.721390 1 server_family.cc:665] Host OS: Linux 6.6.16-linuxkit aarch64 with 1 threads
I20240402 06:52:46.721801 1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
W20240402 06:52:46.721869 1 server_family.cc:758] Load snapshot: No snapshot found
I20240402 06:52:46.722051 8 listener_interface.cc:122] sock[5] AcceptServer - listening on port 9999
I20240402 06:52:46.722074 8 listener_interface.cc:122] sock[6] AcceptServer - listening on port 6379
I20240402 06:53:00.228514 8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:53:00-summary.dfs" finished after 0 us
I20240402 06:54:02.930020 8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:54:00-summary.dfs" finished after 2 s
used_memory_peak_rss:320741376
Limited to 640Mi (512Mi maxmemory, not crashing): Output:
I20240402 06:56:57.617758 1 init.cc:70] dragonfly running in opt mode.
* Logs will be written to the first available of the following paths:
/tmp/dragonfly.*
./dragonfly.*
* For the available flags type dragonfly [--help | --helpfull]
* Documentation can be found at: https://www.dragonflydb.io/docs
I20240402 06:56:57.617822 1 dfly_main.cc:627] Starting dragonfly df-e8650ed-e8650ed2b4ebd550c966751dd33ebb1ac4f82b1f
W20240402 06:56:57.617899 1 dfly_main.cc:666] SWAP is enabled. Consider disabling it when running Dragonfly.
I20240402 06:56:57.617913 1 dfly_main.cc:671] maxmemory has not been specified. Deciding myself....
I20240402 06:56:57.617915 1 dfly_main.cc:680] Found 640.00MiB available memory. Setting maxmemory to 512.00MiB
I20240402 06:56:57.618276 8 uring_proactor.cc:172] IORing with 1024 entries, allocated 102464 bytes, cq_entries is 2048
I20240402 06:56:57.618389 1 proactor_pool.cc:146] Running 1 io threads
I20240402 06:56:57.619359 1 dfly_main.cc:258] Listening on admin socket any:9999
I20240402 06:56:57.619737 1 server_family.cc:665] Host OS: Linux 6.6.16-linuxkit aarch64 with 1 threads
I20240402 06:56:57.619998 1 snapshot_storage.cc:108] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
W20240402 06:56:57.620031 1 server_family.cc:758] Load snapshot: No snapshot found
I20240402 06:56:57.620177 8 listener_interface.cc:122] sock[5] AcceptServer - listening on port 9999
I20240402 06:56:57.620189 8 listener_interface.cc:122] sock[6] AcceptServer - listening on port 6379
I20240402 06:57:00.002776 8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:57:00-summary.dfs" finished after 0 us
I20240402 06:58:01.918434 8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:58:00-summary.dfs" finished after 1 s
I20240402 06:59:04.121477 8 save_stages_controller.cc:321] Saving "/dragonfly/snapshots/dump-2024-04-02T06:59:00-summary.dfs" finished after 4 s
E20240402 07:00:09.115413 8 rdb_save.cc:1406] io error system:28
W20240402 07:00:09.325433 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:01:09.622743 8 rdb_save.cc:1406] io error system:28
W20240402 07:01:09.825254 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:02:13.933493 8 rdb_save.cc:1406] io error system:28
W20240402 07:02:14.120755 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:03:17.126403 8 rdb_save.cc:1406] io error system:28
W20240402 07:03:17.223644 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:04:19.825512 8 rdb_save.cc:1406] io error system:28
W20240402 07:04:20.023314 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:05:16.028622 8 rdb_save.cc:1406] io error system:28
W20240402 07:05:16.226780 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:06:20.922616 8 rdb_save.cc:1406] io error system:28
W20240402 07:06:21.125222 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:07:19.232414 8 rdb_save.cc:1406] io error system:28
W20240402 07:07:19.429133 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:08:19.621541 8 rdb_save.cc:1406] io error system:28
W20240402 07:08:19.833389 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:09:15.229292 8 rdb_save.cc:1406] io error system:28
W20240402 07:09:15.426709 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:10:18.745149 8 rdb_save.cc:1406] io error system:28
W20240402 07:10:19.034804 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:11:20.337744 8 rdb_save.cc:1406] io error system:28
W20240402 07:11:20.622592 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:12:20.823269 8 rdb_save.cc:1406] io error system:28
W20240402 07:12:21.012964 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:13:20.706354 8 rdb_save.cc:1406] io error system:28
W20240402 07:13:20.903396 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:14:19.709885 8 rdb_save.cc:1406] io error system:28
W20240402 07:14:20.008428 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:15:19.915210 8 rdb_save.cc:1406] io error system:28
W20240402 07:15:20.305893 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:16:21.213469 8 rdb_save.cc:1406] io error system:28
W20240402 07:16:21.406510 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:17:17.004458 8 rdb_save.cc:1406] io error system:28
W20240402 07:17:17.214521 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:18:19.917520 8 rdb_save.cc:1406] io error system:28
W20240402 07:18:20.219050 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:19:20.011533 8 rdb_save.cc:1406] io error system:28
W20240402 07:19:20.309616 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:20:20.009765 8 rdb_save.cc:1406] io error system:28
W20240402 07:20:20.310956 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:21:18.818560 8 rdb_save.cc:1406] io error system:28
W20240402 07:21:19.203550 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:22:20.719043 8 rdb_save.cc:1406] io error system:28
W20240402 07:22:21.209784 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:23:21.320616 8 rdb_save.cc:1406] io error system:28
W20240402 07:23:21.611627 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:24:21.321049 8 rdb_save.cc:1406] io error system:28
W20240402 07:24:21.622252 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:25:20.007305 8 rdb_save.cc:1406] io error system:28
W20240402 07:25:20.309065 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:26:19.608237 8 rdb_save.cc:1406] io error system:28
W20240402 07:26:19.921449 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:27:20.921967 8 rdb_save.cc:1406] io error system:28
W20240402 07:27:21.305016 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:28:21.204382 8 rdb_save.cc:1406] io error system:28
W20240402 07:28:21.512039 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:29:20.717458 8 rdb_save.cc:1406] io error system:28
W20240402 07:29:21.213011 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:30:21.419281 8 rdb_save.cc:1406] io error system:28
W20240402 07:30:21.610126 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:31:19.925829 8 rdb_save.cc:1406] io error system:28
W20240402 07:31:20.208326 8 server_family.cc:918] Failed to perform snapshot No space left on device
E20240402 07:32:20.812709 8 rdb_save.cc:1406] io error system:28
W20240402 07:32:21.014976 8 server_family.cc:918] Failed to perform snapshot No space left on device
^CI20240402 07:33:16.215328 8 accept_server.cc:24] Exiting on signal Interrupt
E20240402 07:33:16.218310 8 rdb_save.cc:1406] io error system:28
I20240402 07:33:16.221161 8 listener_interface.cc:222] Listener stopped for port 9999
W20240402 07:33:16.413079 8 server_family.cc:918] Failed to perform snapshot No space left on device
I20240402 07:33:16.413216 8 listener_interface.cc:222] Listener stopped for port 6379
E20240402 07:33:35.016315 8 rdb_save.cc:1406] io error system:28
W20240402 07:33:35.311447 8 server_family.cc:784] Failed to perform snapshot No space left on device
used_memory_peak_rss:668233728
My assumption is that due to dragonfly running in cache mode, it should never go OOM - so no crash should happen. If it was not running in cache mode, i would expect it to write out OOM responses if i write data to it when current data + incoming data > maxmemory and also not crash.
@applike-ss we plan to take into account the additional memory of serializing to file (and also for new replica connecting in full sync stage), and perhaps fail / slow down these actions when memory is tight. Also, we plan to migrate from using use-memory (the memory used only by the data) and start using rss (actual memory usage). Both actions will take at the very least a few weeks, likely more, so in the meanwhile you could add some memory buffer for these operations specifically.
Describe the bug I have discovered crashes of dragonfly upon testing the cache mode to ensure it is stable before using it in prod workloads. It seems that it is crashing at a random point when putting lots of data
To Reproduce Steps to reproduce the behavior:
dd if=/dev/urandom bs=1024 count=100| base64 > data.txt
redis-pump.sh
(put same folder as data.txt):CLUSTER_HOSTNAME=$1 MAX_ITERATIONS=$2
i=0
if [ "" = "${MAX_ITERATIONS}" ]; then echo "pushing as much data as possible to ${CLUSTER_HOSTNAME}" else echo "pushing ${MAX_ITERATIONS} chunks of data to ${CLUSTER_HOSTNAME}" fi
DATA="$(cat data.txt)"
while true; do i=$(expr $i + 1)
done
I20240318 14:37:46.707206 1 init.cc:70] dragonfly running in opt mode.
redis-pump.sh
Environment (please complete the following information):
Reproducible Code Snippet
Is there any additional information i can provide or an option to let dragonfly run with debug logging for gathering more interesting information?