dragonflydb / dragonfly-operator

A Kubernetes operator to install and manage Dragonfly instances.
https://www.dragonflydb.io/docs/managing-dragonfly/operator/installation
Apache License 2.0
117 stars 23 forks source link

Cache reset on rolling updates or if master restarts #191

Open cmergenthaler opened 1 month ago

cmergenthaler commented 1 month ago

We have a dragonfly cluster running with 3 replicas, which worked fine so far. In the last day we have observed to strange behaviors:

  1. When doing a rolling update by e.g increasing resource requests, the operator restarts one pod after another (starting with replicas) and chooses a new master. So far so good, but once all pods are up and running again, the cache is being reset instead of synced. Number of entries in the cache:

    Bildschirmfoto 2024-05-29 um 14 19 47
  2. When the master restarts at some point, the cache is also reset instead of failing over to one of the replicas. Since the master restarts, its cache is empty and the 2 replicas are synced with empty caches as well. Number of entries in the cache:

    Bildschirmfoto 2024-05-29 um 14 22 32

Should these two cases be supported already?

Abhra303 commented 1 month ago

Hi @cmergenthaler, Thanks for Reporting!

When doing a rolling update by e.g increasing resource requests, the operator restarts one pod after another (starting with replicas) and chooses a new master. So far so good, but once all pods are up and running again, the cache is being reset instead of synced.

We currently don't have this feature so restarts will loose all the data. But it would be nice enhancement indeed.

When the master restarts at some point, the cache is also reset instead of failing over to one of the replicas. Since the master restarts, its cache is empty and the 2 replicas are synced with empty caches as well.

We have a open PR for that (#189) , I will push some fixes to the PR. So you can expect this in the next version.

cmergenthaler commented 4 weeks ago

Hi @cmergenthaler, Thanks for Reporting!

When doing a rolling update by e.g increasing resource requests, the operator restarts one pod after another (starting with replicas) and chooses a new master. So far so good, but once all pods are up and running again, the cache is being reset instead of synced.

We currently don't have this feature so restarts will loose all the data. But it would be nice enhancement indeed.

So whenever we change something on the deployment, our cache is getting reset? Would be nice if the rolling update makes sure that the cache keeps synced over restarting pods.

When the master restarts at some point, the cache is also reset instead of failing over to one of the replicas. Since the master restarts, its cache is empty and the 2 replicas are synced with empty caches as well.

We have a open PR for that (#189) , I will push some fixes to the PR. So you can expect this in the next version.

Good to hear, thanks!

Just one question regarding HA: Shouldn't the operator always fail-over to one of the replicas as soon as the master restarts so that we don't loose data? Otherwise the cache will always be reset on all replicas

Abhra303 commented 3 weeks ago

Hi @cmergenthaler, Thanks for Reporting!

When doing a rolling update by e.g increasing resource requests, the operator restarts one pod after another (starting with replicas) and chooses a new master. So far so good, but once all pods are up and running again, the cache is being reset instead of synced.

We currently don't have this feature so restarts will loose all the data. But it would be nice enhancement indeed.

So whenever we change something on the deployment, our cache is getting reset? Would be nice if the rolling update makes sure that the cache keeps synced over restarting pods.

Actually, I was wrong here. The operator does handle the syncing of replicas on rolling updates. I think we have a bug here. Will debug more.

cmergenthaler commented 2 weeks ago

Hi @cmergenthaler, Thanks for Reporting!

When doing a rolling update by e.g increasing resource requests, the operator restarts one pod after another (starting with replicas) and chooses a new master. So far so good, but once all pods are up and running again, the cache is being reset instead of synced.

We currently don't have this feature so restarts will loose all the data. But it would be nice enhancement indeed.

So whenever we change something on the deployment, our cache is getting reset? Would be nice if the rolling update makes sure that the cache keeps synced over restarting pods.

Actually, I was wrong here. The operator does handle the syncing of replicas on rolling updates. I think we have a bug here. Will debug more.

Thanks! I think the problem is if a new pod (which was a replica) gets the master role too early (cache is not fully synced with old cache/master yet). We have observed that for example we had 20k keys in the cache. After the replicas restarted and the master was restarting, a replica pod became the master but it only contained 1k keys at that time. So 19k keys were not synced yet and therefore got removed from the cache