Solr operator not updating all container images on helm update

fliphess commented 1 year ago

Hi :)

We are using a gitlab pipeline running helm to deploy our solr cluster. As we want to have some utils like the AWS cli on-board for restoring from a backup, we build a new docker container with every pipeline. As we use the git shasum of the repository, the image changes with every pipeline.

We notice something weird: When updating, some of the nodes are updated, but not all of them: If we have 3 solr pods, 2 of them are using the latest image, but one does not (the setup-zk init container and the solrcloud-node use the same image)

Looking at both the statefulset and the solrcloud objects: Both show that all nodes are ready and uptodate, but one of the pods is not updated at all....

We use the latest (8.11.2) v8 solr version from dockerhub for solr and version 0.7.0 for the operator, where we add some extra helper tools to use when things go haywire.

I'm not sure what information to provide, I can provide a lot :)

kubectl get pods  -n cluster cluster-solrcloud-0 cluster-solrcloud-1 cluster-solrcloud-2 -o yaml  | grep image: | grep solr-cluster | cut -d: -f3 | sort | uniq -c
      2 4119877046180762ee630bd4165c839c488371b7
     10 5f639c6751ab8faa1bd485a3e1b0f7362b3437b2

I've checked the logs of the operator and I don't see any issues: The operator does a loop updating all nodes but skips the last one (3 replicas, node 2 is not updated.)

Our update strategy is as follows:

  updateStrategy:
    managed:
      maxPodsUnavailable: 1
      maxShardReplicasUnavailable: 1
    method: Managed

I have attached my solrcloud yaml to this issue :)

solrcloud.txt

fliphess commented 1 year ago

Adding to this issue: In the solr-operator logging I see a lot of these:

solr-operator-5c7899cdff-ng2tl solr-operator 2023-05-30T15:13:56Z INFO ManagedUpdateSelector Pod update selection canceled. The number of updated pods unavailable equals or exceeds the calculated maxPodsUnavailable. {"controller": "solrcloud", "controllerGroup": "solr.apache.org", "controllerKind": "SolrCloud", "SolrCloud": {"name":"solr-cluster","namespace":"solr-cluster"}, "namespace": "solr-cluster", "name": "solr-cluster", "reconcileID": "c88d7a5a-d2dd-498e-a9a3-f0c789e86ab1", "unavailableUpdatedPods": 1, "outOfDatePodsNotStarted": 0, "alreadyScheduledForDeletion": 0, "maxPodsUnavailable": 1}

Does this mean the update for a specific pod is canceled? Or is it postponed to be updated at a later time?

HoustonPutman commented 1 year ago

Ahh yeah that log line might be a bit unclear. You cannot "cancel" an update to a pod, it's postponed till later, but eventually the pod will be updated (if the conditions to update are met).

That log line is telling you that one of the pods is not healthy, so at some point the pod is not "ready". If there are pods not updated still. Look for the most recent log lines in the operator, to tell you why it isn't continuing and deleting the last few pods. If that log line is still being printed, then for some reason the solr operator does not believe that all Solr pods are "ready".

Maybe the cluster is having issues scheduling the pods after they are being deleted? Can you do a kubectl get pods to show the solr pods?

fliphess commented 1 year ago

Hey @HoustonPutman! Thanks for the reply.

The weird thing is that the statefulset itself shows the correct image tag and so does the solrcloud yaml object. The solrcloud status is indicating it's up to date and the solr-operator is not generating new logging. In the meantime all the pods in the solr cluster are up, but not all of them are properly updated.

Thinking out loud, I didn't check kubectl events, I'll check that tomorrow morning right away, perhaps there is some node anti affinity in the way of scheduling the new pod while the old one is terminating or something...

HoustonPutman commented 1 year ago

In the meantime all the pods in the solr cluster are up, but not all of them are properly updated.

the solr-operator is not generating new logging.

These two things both being true is very very strange. If you could provide the output of kubectl describe solrcloud <name> that could be useful, to look at the solrcloud's status.

fliphess commented 1 year ago

I found something: When I kill the solr-operator pod, everything starts running again and soon after all pods are at the same container version. The solr-operator then starts logging again and triggers new backups etc... So apparently the solr-operator becomes unresponsive.

Before digging any further, let me first check what happens if I give the solr-operator a lot more CPU and memory, perhaps it's running out of something...

fliphess commented 1 year ago

I'm closing this for now: After changing the resources for our operator, it hasn't appeared anymore, so I think this is a cornercase in our own cluster rather than a problem in the operator itself.

Thanks for your help and suggestions Houston! :)

apache / solr-operator

Solr operator not updating all container images on helm update #574