apache / solr-operator

Official Kubernetes operator for Apache Solr
https://solr.apache.org/operator
Apache License 2.0
246 stars 111 forks source link

Getting unauthorized requests for cluster/replicas/balance in v0.8.0 #653

Open ozlerhakan opened 10 months ago

ozlerhakan commented 10 months ago

Hi Team,

We've recently changed the operator version from v0.7.0 to v0.8.0 for our SolrCloud cluster (9.4). It seems that the new version requests the status of an async task named balance-replicas-ScaleUp every 60 seconds. Once it doesn't find one running, it sends a "replica/balance" request to the target Solr, therefore we're getting the following error for every attempt:

Recieved bad response code of 403 from solr with response: {
\"servlet\":\"default\",
\"message\":\"Unauthorized request, Response code: 403\",
\"url\":\"/solr/____v2/cluster/replicas/balance\",
\"status\":\"403\"
}

I found that the relevant metadata for this use case is written into the annotation field of the statefulsets object: "solr.apache.org/clusterOpsLock": "{\"operation\":\"BalanceReplicas\",\"lastStartTime\":\"2023-11-09T14:28:18Z\",\"metadata\":\"ScaleUp\"}"

We're using the default security.json credentials and I'm not certain if there's anything to be changed in our settings for this matter.

Thanks!

mmoscher commented 9 months ago

Not sure, but could be a similar/related issue. After upgrade from v0.7.0 to v0.8.0 for our SolrCloud cluster (9.0), we're getting a 404 from Solr API

Error returned from Solr API: 404. no core retrieved for core name: null. Path : /cluster/replicas/balance

2023-12-22T06:37:54Z    INFO    Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler  {"controller": "solrcloud", "controllerGroup": "solr.apache.org", "controllerKind": "SolrCloud", "SolrCloud": {"name":"solr","namespace":"solr"}, "namespace": "solr", "name": "solr", "reconcileID": "01215b5c-6555-404c-85f5-2a5246ef41cb"}
2023-12-22T06:37:54Z    ERROR   Reconciler error        {"controller": "solrcloud", "controllerGroup": "solr.apache.org", "controllerKind": "SolrCloud", "SolrCloud": {"name":"solr","namespace":"solr"}, "namespace": "solr", "name": "solr", "reconcileID": "01215b5c-6555-404c-85f5-2a5246ef41cb", "error": "Error returned from Solr API: 404. no core retrieved for core name:  null. Path : /cluster/replicas/balance"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:227

Same error on Solr side:

2023-12-22 06:40:54.937 ERROR (qtp1942828992-2283) [] o.a.s.a.V2HttpCall >> path: '/cluster/replicas/balance' 2023-12-22 06:40:54.937 ERROR (qtp1942828992-2283) [] o.a.s.a.V2HttpCall Error in init() => org.apache.solr.common.SolrException: no core retrieved for core name: null. Path : /cluster/replicas/balance at org.apache.solr.api.V2HttpCall.init(V2HttpCall.java:155) org.apache.solr.common.SolrException: no core retrieved for core name: null. Path : /cluster/replicas/balance

why is the operator hitting this path?

Thanks for any help.

HoustonPutman commented 7 months ago

@mmoscher , that is expected. After the completion of an ephemeral rolling restart, the Solr Operator now tries to balance the cluster. If Solr doesn't support that command (and 9.0 does not), it just completes the operation. But the only way that the Solr Operator can know if it's supported is to try running it.

@ozlerhakan I think that is something that we missed. I'll try to add a PR for that soon.

bcbrockway commented 6 days ago

Is there a workaround for this? We're getting the same error after starting up a brand new cluster.