OT-CONTAINER-KIT / redis-operator

A golang based redis operator that will make/oversee Redis standalone/cluster/replication/sentinel mode setup on top of the Kubernetes.
https://ot-redis-operator.netlify.app/
Apache License 2.0
731 stars 207 forks source link

Improving Reliability of statefulset RollingUpdate with Container Lifecycle Hooks #923

Open wkd-woo opened 1 month ago

wkd-woo commented 1 month ago

Is your feature request related to a problem? Please describe. In scenarios like a Redis version upgrade that alter the desired status of a statefulset, the statefulset's updateStrategy causes Pods to undergo a RollingUpdate.

Assuming we have a 3-member replication setup, there is a risk of data loss if a pod goes down momentarily without securing a replica, due to a lack of reconcile by the operator during the RollingUpdate.

Therefore, during the rollingUpdate process facilitated by the statefulset, it is crucial to ensure that at least one replica, synchronized with the leader, is secured.

While it is possible to think setting the statefulset's terminationGracePeriodSeconds to a sufficiently long duration to delay the rollingUpdate might be adequate,

I believe using Container Lifecycle Hooks to functionally guarantee this would significantly enhance the project’s reliability.

Describe the solution you'd like Describe alternatives you've considered I propose writing event code for the PreStop hook to check whether a failover-capable replica is secured before terminating the container:

If the pod designated for deletion has a redis-role of slave, then it is safe to delete the pod.

If it’s a master, wait until a currently synced replica is secured. If already secured, proceed. If syncing is ongoing, remain in the loop until complete. masterSyncInProgress == 0

127.0.0.1:6379> INFO REPLICATION
# Replication
role:slave
master_host: xxx.xxx.xxx.xxx
master_port:6379
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0

...

I would like to hear what the maintainers think about this issue and the development of this feature.

If it's difficult for you to allocate time, I would like to add this feature myself and submit a Pull Request.

What version of redis-operator are you using?

redis-operator version:

Additional context Here's the pseudo-code of the PreStop event code.

### Pseudo-Code
infoReplication := redis-cli INFO REPLICATION

role := infoReplication[role]
masterSyncInProgress := infoReplication[master_sync_in_progress]
connectedSlaved := infoReplication[connected_slaves]
masterLinkStartup := infoReplication[master_link_startup]

if role == "master":
   while !(connected_slaves > 0 && masterSyncInProgress == 0):
      sleep(1)
   else:
      exit(0)
else if role == "slave":
    while !(masterLinkStartup == "up"):
       sleep(1)
    else:
      exit(0)
sapisuper commented 5 days ago

@wkd-woo Hi any update regarding this enhancement ?

wkd-woo commented 2 days ago

@wkd-woo Hi any update regarding this enhancement ?

@sapisuper No, the maintainers don't give any feedback yet on this enhancement.