Enapter / charts

Enapter Helm Charts
MIT License
48 stars 49 forks source link

cleanupTempfiles.minutes - default value #61

Open air3ijai opened 1 year ago

air3ijai commented 1 year ago

Hello,

We just did a test how the Pod will handle multiple restarts during backups.

  1. At some point there maybe snapshot creation started and interrupted
  2. As a result we may have a temporary backup file, which was not finished
    drwxr-xr-x. 1 root root          56 Jan  4 11:04 ..
    -rw-r--r--. 1 root root 20547669028 Jan  4 10:02 dump.rdb
    -rw-r--r--. 1 root root  5188599808 Jan  4 11:03 temp-1-3.rdb
    -rw-r--r--. 1 root root  1432674655 Jan  4 10:46 temp-1-9.rdb
    -rw-r--r--. 1 root root  1078273848 Jan  4 10:46 temp-2086607563.1.rdb
    -rw-r--r--. 1 root root           0 Jan  4 11:06 temp-2088105784.1.rdb
  3. At the next start KeyDB will load the data and then start to sync from the Master
  4. After the sync it will perform a new backup
  5. This backup can be interrupted and as a result we may have one more temp file.

Doing this in a loop, we may running out of disk space. It is for sure a corner case.

Current value for the cleanupTempfiles.minutes is 60 minutes and it will not delete all precedent crashes happened just some minutes ago.

What is the main reason to have such a big value?

For Bitnami Redis Chart we use the following

master:
  preExecCmds: "rm -rf /data/temp*.*"

So, we will delete all temporary files right before the Redis start.

air3ijai commented 1 year ago

Got the issue today on the Dev after a lot of issues we experienced yesterday in the Kubernetes cluster

Dumps

drwxr-xr-x. 2 root root         102 Jan 11 15:38 .
drwxr-xr-x. 1 root root          56 Jan 10 14:53 ..
-rw-r--r--. 1 root root 20553313533 Jan 10 07:13 dump.rdb
-rw-r--r--. 1 root root  2190929920 Jan 10 07:43 temp--1701050988.1.rdb
-rw-r--r--. 1 root root  1806061732 Jan 11 15:38 temp-324797-0.rdb
-rw-r--r--. 1 root root  2700424307 Jan 10 07:43 temp-652292-0.rdb

Save error loop

1:319:S 11 Jan 2023 15:35:47.933 * Replica 192.168.10.10:6379 asks for synchronization
1:319:S 11 Jan 2023 15:35:47.933 * Full resync requested by replica 192.168.10.10:6379
1:319:S 11 Jan 2023 15:35:47.933 * Starting BGSAVE for SYNC with target: disk
1:319:S 11 Jan 2023 15:35:48.105 * Background saving started by pid 324179
1:319:S 11 Jan 2023 15:35:48.105 * Background saving started
324179:319:C 11 Jan 2023 15:38:35.454 # Write error saving DB on disk: No space left on device
1:319:S 11 Jan 2023 15:38:36.601 # Background saving error
1:319:S 11 Jan 2023 15:38:36.601 # SYNC failed. BGSAVE child returned an error
1:319:S 11 Jan 2023 15:38:36.601 # Connection with replica 192.168.10.10:6379 lost.

1:319:S 11 Jan 2023 15:38:36.783 * Replica 192.168.10.10:6379 asks for synchronization
1:319:S 11 Jan 2023 15:38:36.783 * Full resync requested by replica 192.168.10.10:6379
1:319:S 11 Jan 2023 15:38:36.783 * Starting BGSAVE for SYNC with target: disk
1:319:S 11 Jan 2023 15:38:36.956 * Background saving started by pid 324797
1:319:S 11 Jan 2023 15:38:36.956 * Background saving started
324797:319:C 11 Jan 2023 15:41:19.609 # Write error saving DB on disk: No space left on device
1:319:S 11 Jan 2023 15:41:20.887 # Background saving error
1:319:S 11 Jan 2023 15:41:20.887 # SYNC failed. BGSAVE child returned an error
1:319:S 11 Jan 2023 15:41:20.887 # Connection with replica 192.168.10.10:6379 lost.