We just observed this behavior and in the logs i discovered this error: error running SLAVE OF command: dial tcp 10.138.59.180:9999: i/o timeout, so i'll assume that either of this happened:
network issue
dragonfly main/networking thread blocked
dragonfly crashed without killing the process
Due to this i would like to suggest the following changes:
Check via redis client that the operator can talk to the new master before promoting it
Check via redis client that the operator can talk to the (now) replicas before setting it to slave of new master
kill the pod if it can't talk to it after X tries (configurable? 0 meaning, do not kill it?)
Regarding this: https://github.com/dragonflydb/dragonfly-operator/blob/64cfcbae58dc68c600f313b344e6ad19ad332fe6/internal/controller/dragonfly_instance.go#L116
and this: https://github.com/dragonflydb/dragonfly-operator/blob/64cfcbae58dc68c600f313b344e6ad19ad332fe6/internal/controller/dragonfly_instance.go#L117
We just observed this behavior and in the logs i discovered this error:
error running SLAVE OF command: dial tcp 10.138.59.180:9999: i/o timeout
, so i'll assume that either of this happened:Due to this i would like to suggest the following changes:
0
meaning, do not kill it?)