Open ramayer opened 1 year ago
...where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181
I see the same on macOS on Apple M1, using Docker Desktop 4.9.1. There is some communication issue betwen the zk nodes, so I always run with only 1 zk when debugging on my dev MacBook.
@janhoy
Thanks. Adding --set zk.provided.replicas=1
worked for me on the macOS/colima environment. However the minikube-on-ubuntu-when-not-using-containerd example I mentioned above it still fails where there seems to be a different communication failure between the solr nodes when trying to create a 3-shard collection on a 3-node cluster.
@ramayer can you maybe catch any of the error logs from:
where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181
and
where it seems each Solr instance can communicate with the other two just fine, but appears to have a network timeout when it attempts to communicate with another shard on the same host. I can create some collections, but am unable to creates any collection that has as many shards as solr pod instances.
and any other cases where you seem to have an idea of what is happening. I know you shared how to reproduce (and that is helpful) - any log messages from ZK or Solr would potentially help as well.
The error messages don't show much except for timeouts that look like this
2022-12-08 21:22:07.442 WARN (qtp1221991240-60) [] o.a.s.h.a.AdminHandlersProxy Timeout when fetching result from node default-example-solrcloud-1.ing.local.domain:80_solr => java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.FutureTask.get(Unknown Source)
java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(Unknown Source) ~[?:?]
at org.apache.solr.handler.admin.AdminHandlersProxy.maybeProxyToNodes(AdminHandlersProxy.java:112) ~[?:?]
at org.apache.solr.handler.admin.SystemInfoHandler.handleRequestBody(SystemInfoHandler.java:137) ~[?:?]
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224) ~[?:?]
at org.apache.solr.handler.admin.InfoHandler.handle(InfoHandler.java:94) ~[?:?]
at org.apache.solr.handler.admin.InfoHandler.handleRequestBody(InfoHandler.java:82) ~[?:?]
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224) ~[?:?]
at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:941) ~[?:?]
at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:893) ~[?:?]
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:584) ~[?:?]
at org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:250) ~[?:?]
at org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:218) ~[?:?]
at org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257) ~[?:?]
...
but this is an easy way to reproduce a similar timeout on Minikube on Ubuntu using the docker runtime, and on Colima on Macos, this command
kubectl exec -it pod/example-solrcloud-1 -- curl default-example-solrcloud-0.ing.local.domain/solr/
works fine (returning the Solr admin page as you'd expect), but this
kubectl exec -it pod/example-solrcloud-1 -- curl default-example-solrcloud-1.ing.local.domain/solr/
times out. I think similar timeouts happen when a collection has multiple shards on the same node the client connected to.
For the minikube failure mode mentioned above, I think it's related to this minikube github issue: https://github.com/kubernetes/minikube/issues/13370 with the workaround from those comments including adding --cni=bridge
to the minikube startup line.
I still didn't have any luck finding workarounds for the colima --kubernetes
distribution of kuberenets.
Some comments there suggest that iptables-based implementations of services can have trouble when a pod tries to connect to itself through a service-url. I wonder if this operator could be tweaked to have pods not try to reach themselves through the service name --- but don't even understand enough to know if that makes sense.
Solr Operator's working great in about half of the Kubernetes environment I'm testing; but fails in about the other half.
It fails for me on MacOS using a kubernetes environment created with:
where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181 . It seems as if colima's kubernetes's (I think k3s) default networking is not allowing connections to that service until the service is ready (which never seems to happen); but I don't know how to debug this further.
It works fine for me on the same MacOS host using a kubernetes environment created with:
It fails for me on Ubuntu 22.04 using a kubernetes environment started with:
where it seems each Solr instance can communicate with the other two just fine, but appears to have a network timeout when it attempts to communicate with another shard on the same host. I can create some collections, but am unable to creates any collection that has as many shards as solr pod instances.
It works fine for me on the same Ubuntu 22.04 host using:
It works fine for me on Microsoft Azure's AKS using the instructions here.
In all cases, after creating the Kubernetes environment, I'm attempting to create the solr cluster with
I think most of the failure modes seem to be related to when during the startup process Kuberentes exposes enough information (DNS? IP addresses?) to nodes during the startup process -- but I don't quite know Kubernetes networking well enough to debug this.