apache / solr-operator

Official Kubernetes operator for Apache Solr
https://solr.apache.org/operator
Apache License 2.0
246 stars 111 forks source link

Support Zookeeper `probes` parameters in Apache Solr Operator helm charts. #477

Closed iampranabroy closed 1 year ago

iampranabroy commented 2 years ago

Describe the issue:

When deploying SolrCloud via Apache Solr Operator with ensembled Zookeeper, sometimes one of the zookeeper pods gives the below error during the start:

2021-03-29 13:33:56,645 [myid:2] - ERROR [main:QuorumPeerMain@113] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: My id 2 not in the peer list
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1073)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)

Possible solutions:

As per the discussions in the GitHub issue-315, they are suggesting increasing the probes.readiness.initialDelaySeconds from default 10 to 30/60 sec. Can we add support for the zookeeper config.* parameters in Apache Solr Operator helm charts?

mmoscher commented 2 years ago

@iampranabroy did you found an interim solution? Facing the same issue and I think (for now) the only way to go is to deploy an separated zookeeper cluster.

Will dig into this and will submit an PR. Shouldn't be that hard I think.

iampranabroy commented 2 years ago

Hey @mmoscher - As of now, NO. If you can raise a PR that would be great. @HoustonPutman - If there are any upcoming minor releases, can we add this item?

mmoscher commented 2 years ago

~However, can confirm that the described solutions, i.e. increasing the livenessProbe.initialDelaySeconds, works. Setting this to 30s I was able to successfully deploy a zookeeper cluster with replicas > 1.~

//Edit: false positive ... just had a bunch of luck. For now I'm unable to successfully (re-)deploy a zookeeper cluster. Let's move this discussion back to: https://github.com/pravega/zookeeper-operator/issues/315

HoustonPutman commented 1 year ago

@mmoscher We can definitely add probes support through the Solr Operator, but just to make sure you solved this issue independently from any Solr/ZK settings correct?

mmoscher commented 1 year ago

@HoustonPutman yes, solved it without using any probes. The problem was related to wrong NetworkPolicies and old (maybe corrupted) configs in the zookeeper PVC, cf. https://github.com/pravega/zookeeper-operator/issues/315#issuecomment-1259187314

iampranabroy commented 1 year ago

Hey, @mmoscher - Thanks for your response. In my case, I have the Solr cluster and zookeeper cluster deployed in the same namespace, but I have seen this error several times. If we can add the support for probes.readiness.initialDelaySeconds, we can see if that resolves the problem.

@mmoscher - Do you have your zookeeper and solr deployed in the same namespace or a different namespace? Was curious about allow-zookeeper-access: true

mmoscher commented 1 year ago

@iampranabroy yes, all resources (Solr + ZK) in the same namespace with NetworkPolicies denying all pod's egress traffic.