Open kos-team opened 4 months ago
Thanks for finding and reporting this! Yeah for some reason in #689 I set err = nil
, at https://github.com/apache/solr-operator/blob/b2c951052828f88e9fc415f54aec189b805117e9/controllers/solrcloud_controller.go#L493, which might be the cause for this. I'll need to investigate why I did that, but we will certainly get this fixed for v0.9.0!
@HoustonPutman Thanks for the confirmation. Looking forward!
Bug Report
What did you do? / Minimal Reproducible Example We first deployed a five-replica Solr cluster, then modified the CR to scale down to three replicas. The scale down triggers the solr-operator to perform the ManagedScaleDown operation, invoking the
handleManagedCloudScaleDown
function. Inside the function, the operator invokes theevictSinglePod
function. If the call to the Solr transiently fails (due to transient network issue or transient unavailability of Solr), the operator writes an error message and never retries. Even if the transient fault is resolved later, the operator is stuck and never perform the scale down operation.To reproduce this bug,
replicas: 5
toreplicas:3
in the CR to initiate the scale down.What did you expect to see? The operator should be able to scale down the cluster after the transient faults go away
What did you see instead? The operator never requeues the request
Root Cause The root cause of this bug is at https://github.com/apache/solr-operator/blob/b2c951052828f88e9fc415f54aec189b805117e9/controllers/solr_cluster_ops_util.go#L251C57-L251C71 where the operator does not handle the error returned by
evictSinglePod
but just directly returns normally.The fix should be simple to add an error handling branch to requeue the request.