dragonflydb / dragonfly-operator

A Kubernetes operator to install and manage Dragonfly instances.
https://www.dragonflydb.io/docs/managing-dragonfly/operator/installation
Apache License 2.0
132 stars 28 forks source link

Operator sometimes does not apply all updates #93

Closed pstewy closed 11 months ago

pstewy commented 1 year ago

When working with the operator, I found there are times where it thinks the pods it is managing match the statefulset, but that is not the case. An example of this is applying an update to the installed CRD instance that updates the replicas and the memory limit. If I increase the replicas, then the operator will spin up the new pods, but never update the already existing pods.

While doing some troubleshooting, I found that adding a few second sleep here fixes the problem (which I figure is not an acceptable fix). It appears the operator retrieves the statefulset before the cluster has fully acknowledged the change.

Pothulapati commented 1 year ago

Ah, So a CRD update that includes both replicas change along with memory limit? Maybe the rollout mechanism is not getting triggered as we expect to. Will investigate! 👍🏼

sleep is probably not an acceptable fix, and we should think of something better here! Also, Change ack is usually pretty fast right? wonder why its not working as expected 🤔

pstewy commented 1 year ago

Yeah sleep definitely is not ideal and does not (to no one's surprise 😄 ) fix all the cases. My understanding is the ctrl client by default uses a cache, so its possible the request to get the statefulset returns the cached object. I tried comparing the spec of the created one (which I pulled before applying the update to the cluster) vs the existing one (taken from this thread) and my initial testing shows it working as expected. Another method I've read is taking a hash of the spec and adding it to the pods as an annotation. Then when checking whether we need to do a rollout or not we compare the hash on the pods vs. the hash of the changes that were just applied. This covers the case where the operator is restarted for whatever reason between applying the changes to the cluster and triggering the rollout.