Enapter / charts

Enapter Helm Charts
MIT License
48 stars 49 forks source link

Scaling up/down - Downtime/Missing data #60

Closed air3ijai closed 1 year ago

air3ijai commented 1 year ago

Hello,

We just tried your helm chart and it works fine. And we also did some test to check how it will handle scaling up/down events.

Methodology

  1. Create Multi-Master setup with 2 nodes
  2. Run GET <KEY> continuously
  3. Scale up pods
  4. Scale down pods

Test script

while true; do
  reply=$(keydb-cli -h keydb-scaling.default.svc.cluster.local -p 6379 GET test)
  echo "`date +"%T.%3N"` - $reply"
  sleep 0.2
done

Scale up 2 --> 3

11:01:30.116 - test value
11:01:30.428 - 
11:01:30.656 - test value

Scale down 3 --> 2

11:13:11.826 - test value
11:13:12.056 - 
11:13:12.278 - test value

We see that sometimes we get an empty reply and it is probably because data is not yet replicated and pod was already added to the service.

Is there anyway to improve that?

Thank you!

Antiarchitect commented 1 year ago

@air3ijai Could you provide me the results of sending PING command via keydb-cli during the scaling process:

while true; do
  reply=$(keydb-cli -h keydb-scaling.default.svc.cluster.local -p 6379 PING)
  echo "`date +"%T.%3N"` - $reply"
  sleep 0.2
done

If there is some response other than PONG I could improve the readiness probe.

air3ijai commented 1 year ago

We performed a slightly different test, but it probably cover an initial concern as well

  1. We send 3 commands - PING, GET test, INFO KEYSPACE
  2. sessionAffinity is set to None
  3. Restart KeyDB second Master

Script

while true; do
  reply=$(keydb-cli -h keydb-scaling.default.svc.cluster.local -p 6379 <<<$'PING\nGET test\nINFO KEYSPACE' 2>/dev/null)
  echo "`date +"%T.%3N"` - $reply"
  sleep 0.1
done

Results

11:48:11.628 - db0:keys=1994658 - test value
11:48:11.791 - db0:keys=470738  - LOADING KeyDB is loading the dataset in memory
11:48:11.919 - db0:keys=1994658 - test value
11:48:12.047 - db0:keys=1994658 - test value
11:48:12.200 - db0:keys=470738  - LOADING KeyDB is loading the dataset in memory
11:48:12.354 - db0:keys=470738  - LOADING KeyDB is loading the dataset in memory
11:48:12.481 - db0:keys=1994658 - test value
11:48:14.532 - db0:keys=474243  - LOADING KeyDB is loading the dataset in memory
Details **Second Master** ``` 1:1:S 04 Jan 2023 11:47:50.376 # Server initialized 1:1:S 04 Jan 2023 11:47:50.377 * RedisBloom version 2.2.18 (Git=8b6ee3b) 1:1:S 04 Jan 2023 11:47:50.377 * Module 'bf' loaded from /usr/lib/redis/modules/redisbloom.so 1:1:S 04 Jan 2023 11:47:50.378 * Loading RDB produced by version 6.3.1 1:1:S 04 Jan 2023 11:47:50.378 * RDB age 986 seconds 1:1:S 04 Jan 2023 11:47:50.378 * RDB memory usage when created 23260.60 Mb 1:1:S 04 Jan 2023 11:49:39.837 # Done loading RDB, keys loaded: 0, keys expired: 0. 1:1:S 04 Jan 2023 11:49:39.837 * DB loaded from disk: 109.459 seconds 1:1:S 04 Jan 2023 11:49:39.837 * RDB contains information on 1 masters 1:1:S 04 Jan 2023 11:49:39.837 * Cached master recovered from RDB for keydb-scaling-0.keydb-scaling-headless:6379 1:366:S 04 Jan 2023 11:49:39.837 * Thread 0 alive. 1:367:S 04 Jan 2023 11:49:39.837 * Thread 1 alive. 1:366:S 04 Jan 2023 11:49:39.838 * Connecting to MASTER keydb-scaling-0.keydb-scaling-headless:6379 1:366:S 04 Jan 2023 11:49:39.839 * MASTER <-> REPLICA sync started ``` **Check** ``` 11:48:11.628 - PONG test value # Keyspace db0:keys=1994658,expires=70008,avg_ttl=9221956221810000896,cached_keys=1994658 11:48:11.791 - PONG LOADING KeyDB is loading the dataset in memory # Keyspace db0:keys=470738,expires=0,avg_ttl=0,cached_keys=470738 11:48:11.919 - PONG test value # Keyspace db0:keys=1994658,expires=70008,avg_ttl=9221956221810000896,cached_keys=1994658 11:48:12.047 - PONG test value # Keyspace db0:keys=1994658,expires=70008,avg_ttl=9221956221810000896,cached_keys=1994658 11:48:12.200 - PONG LOADING KeyDB is loading the dataset in memory # Keyspace db0:keys=470738,expires=0,avg_ttl=0,cached_keys=470738 11:48:12.354 - PONG LOADING KeyDB is loading the dataset in memory # Keyspace db0:keys=470738,expires=0,avg_ttl=0,cached_keys=470738 11:48:12.481 - PONG test value # Keyspace db0:keys=1994658,expires=70008,avg_ttl=9221956221810000896,cached_keys=1994658 11:48:14.532 - PONG LOADING KeyDB is loading the dataset in memory # Keyspace db0:keys=474243,expires=0,avg_ttl=0,cached_keys=474243 ```

Conclusions

  1. After restart and passing Readiness probe, second Master is added to the Service and starting to receive the traffic
  2. During the start, second Master is loading local backup and then start to sync from the first Master
  3. During loading (?) and sync KeyDB is returning empty replies because data is missed but reply to the PING
  4. In case we have a sessionAffinity enabled, we may stick for some period of time with the node which does not return the data
  5. In case of big datasets, this period may take long - in our case 20GB ~ 2 minutes
Antiarchitect commented 1 year ago

Here are default readiness:

#!/bin/bash

set -e

[[ -n "${REDIS_PASSWORD}" ]] && export REDISCLI_AUTH="${REDIS_PASSWORD}"
response="$(
  timeout -s 3 "${1}" \
  keydb-cli \
    -h localhost \
    -p "${REDIS_PORT}" \
    ping
)"
if [ "${response}" != "PONG" ]; then
  echo "${response}"
  exit 1
fi

and liveness:

#!/bin/bash

set -e

[[ -n "${REDIS_PASSWORD}" ]] && export REDISCLI_AUTH="${REDIS_PASSWORD}"
response="$(
  timeout -s 3 "${1}" \
  keydb-cli \
    -h localhost \
    -p "${REDIS_PORT}" \
    ping
)"
if [ "${response}" != "PONG" ] && [[ ! "${response}" =~ ^.*LOADING.*$ ]]; then
  echo "${response}"
  exit 1
fi

scripts for probes. As you can see only for liveness probe LOADING ... response is valid and for readiness it's not. So when KeyDB is loading data it should be not ready and the service should not route the requests on it. Are you sure you use latest chart version and use default startup, liveness and readiness probes?

Antiarchitect commented 1 year ago

The only thing which is suspicious to me is the default success and failure thresholds. Will check that and will try to reproduce your issue.

air3ijai commented 1 year ago

Are you sure you use latest chart version and use default startup, liveness and readiness probes?

As I see in the installation manifest we didn't touch probes and use the latest chart version

helm list -n default | grep keydb

NAME     NAMESPACE    REVISION    UPDATED                                STATUS      CHART           APP VERSION
keydb    default      3           2022-12-27 22:51:47.86499 +0200 EET    deployed    keydb-0.43.1    6.3.1
kubects describe statefulset -n default keydb

Pod Template:
  Labels:       app.kubernetes.io/instance=keydb
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=keydb
                app.kubernetes.io/version=6.3.1
                helm.sh/chart=keydb-0.43.1

    Liveness:   exec [sh -c /health/ping_liveness_local.sh 5] delay=20s timeout=6s period=5s #success=1 #failure=5
    Readiness:  exec [sh -c /health/ping_readiness_local.sh 1] delay=20s timeout=2s period=5s #success=1 #failure=5
    Startup:    exec [sh -c /health/ping_readiness_local.sh 1] delay=0s timeout=2s period=5s #success=1 #failure=24

And default values - https://github.com/Enapter/charts/blob/master/keydb/values.yaml#L98-L132

# Liveness Probe
livenessProbe:
  initialDelaySeconds: 20
  periodSeconds: 5
  timeoutSeconds: 5
  successThreshold: 1
  failureThreshold: 5

# Readiness Probe
readinessProbe:
  initialDelaySeconds: 20
  periodSeconds: 5
  timeoutSeconds: 1
  successThreshold: 1
  failureThreshold: 5

# Startup Probe
startupProbe:
  periodSeconds: 5
  timeoutSeconds: 1
  failureThreshold: 24
air3ijai commented 1 year ago

@Antiarchitect, I observed the following in liveness probe

  1. When we do just PING during load we receive a reply PONG
  2. When we do GET test during load we recieve a reply (error) LOADING KeyDB is loading the dataset in memory

So, the probe like PING does not help us to identify the issue with LOADING and we should perform a kind of GET 'KEY'

# Loading state
keydb-cli get non-existing-key
(error) LOADING KeyDB is loading the dataset in memory

echo $?
0

# Ready state
keydb-cli get non-existing-key
(nil)

echo $?
0
Antiarchitect commented 1 year ago

I see. Sorry for the delay. I will check this when I have time :) Do you test on latest KeyDB 6.3.1 and could you please check is the behavior of the PING the same on 6.2.2 too?

air3ijai commented 1 year ago

@Antiarchitect, another simple test using Docker Compose with v6.3.1 and v6.2.2.

We can see, that both versions have similar behavior and we should consider that in our probes.

Docker compose

version: '3'
services:

  keydb:
    image: eqalpha/keydb:x86_64_v6.3.1
    # image: eqalpha/keydb:x86_64_v6.2.2
    container_name: keydb
    ports:
      - 6379:6379
    volumes:
      - ./data:/data

Disable snapshots

keydb-cli config set save ""

Populate the data

for key in {1..10}; do
  keydb-cli DEBUG POPULATE 10000 "test${key}" 100000
done

Save snapshot

keydb-cli bgsave

Check

while true; do
  reply=$(timeout 1 keydb-cli <<<$'PING\nGET test')
  echo "`date +"%T.%3N"` - $reply"
  sleep 1
done

Restart container

docker-compose down && docker-compose up -d

Results

19:47:38.276 - PONG
19:47:39.292 - PONG
19:47:40.310 - PONG
19:47:41.325 - PONG
19:47:42.340 - PONG
19:47:43.355 - PONG
19:47:45.363 -
19:47:47.370 -
19:47:49.377 -
19:47:51.384 -
Error: Connection reset by peer
19:47:52.395 -
19:47:54.402 -
19:47:56.409 - PONG
19:47:58.416 - PONG
19:48:00.413 - PONG
LOADING KeyDB is loading the dataset in memory
19:48:02.370 - PONG
LOADING KeyDB is loading the dataset in memory
19:48:04.296 - PONG
LOADING KeyDB is loading the dataset in memory
19:48:06.219 - PONG
LOADING KeyDB is loading the dataset in memory
19:48:08.140 - PONG
LOADING KeyDB is loading the dataset in memory
19:48:10.061 - PONG
LOADING KeyDB is loading the dataset in memory
19:48:11.986 - PONG
LOADING KeyDB is loading the dataset in memory
19:48:13.909 - PONG
LOADING KeyDB is loading the dataset in memory
19:48:15.410 - PONG
19:48:16.426 - PONG
19:48:17.444 - PONG
19:48:18.459 - PONG
19:48:19.473 - PONG
19:48:20.489 - PONG
Antiarchitect commented 1 year ago

Good news everyone! At last I've recreated this setup and improved liveness and readiness probes based on the tests. So the main point is now readiness probe will try not to PING but GET random uuid key (non-existent key) and compare the result with "LOADING KeyDB is loading the dataset in memory" - which is loading response. Until loading string appears in the response the pod counts not ready. @air3ijai please test your case with new 0.44.0 keydb chart version. If the problem still appears I will reopen this issue again.

Antiarchitect commented 1 year ago

@mrsrvman PTAL. I know the original probes were implemented by you. Do you agree with the current solution?

P.S. I've looked inside the KeyDB sources and PING command seems have allow-loading internal flag so it does not report with loading string ever. So I cleaned it up.