No registered leader was found after waiting for

Hi,

I'm using solr-operator v0.7.0 and zookeeper-operator 0.2.15. SOLR version is 6.6.2.

One of our SOLR clusters has 17 nodes (17 shards of around 100GB with 1 replica). From time to time some shards starts failing and error in the logs is:

2023-06-28 10:15:47.505 ERROR (recoveryExecutor-3-thread-1-processing-n:solr-xxx-solrcloud-11.solr-xxx-solrcloud-headless.xxx:8983_solr x:xxx_shard13_replica2 s:shard13 c:xxx r:core_node77) [c:xxx s:shard13 r:core_node77 x:xxx_shard13_replica2] o.a.s.c.RecoveryStrategy Error while trying to recover. core=xxx_shard13_replica2:org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: xxx slice: shard13
    at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:748)
    at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:734)
    at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:368)

I've added solrOpts: "-DzkClientTimeout=1200000" but it cannot be the timeout I have. I tried solrZkOpts: '-Dzookeeper.connection.timeout.ms=60000' but that did not change anything.

Could you please advice what I can do about this issue?

Thanks!

apache / solr-operator

No registered leader was found after waiting for #581