Solr Operator seems very picky about the Kubernetes environment it's using (guessing networking/dns)

ramayer commented 1 year ago

Solr Operator's working great in about half of the Kubernetes environment I'm testing; but fails in about the other half.

It fails for me on MacOS using a kubernetes environment created with:

colima start --cpu 4 --memory 8 --kubernetes

where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181 . It seems as if colima's kubernetes's (I think k3s) default networking is not allowing connections to that service until the service is ready (which never seems to happen); but I don't know how to debug this further.

It works fine for me on the same MacOS host using a kubernetes environment created with:

podman machine init -m 16000 --cpus 4 -v "$HOME:$HOME" --rootful
podman machine start
minikube start --driver=podman --cpus 4 --memory 12000 --profile=minikube-on-podman

It fails for me on Ubuntu 22.04 using a kubernetes environment started with:

minikube start

where it seems each Solr instance can communicate with the other two just fine, but appears to have a network timeout when it attempts to communicate with another shard on the same host. I can create some collections, but am unable to creates any collection that has as many shards as solr pod instances.

It works fine for me on the same Ubuntu 22.04 host using:

 minikube start --container-runtime=containerd --cpus 4 --mount-string=$HOME/proj/kube/persistent_volumes:/mnt/host --mount

It works fine for me on Microsoft Azure's AKS using the instructions here.

In all cases, after creating the Kubernetes environment, I'm attempting to create the solr cluster with

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/static/provider/cloud/deploy.yaml
kubectl create -f https://solr.apache.org/operator/downloads/crds/v0.6.0/all-with-dependencies.yaml
helm install solr-operator apache-solr/solr-operator --version 0.6.0
helm install example-solr apache-solr/solr --version 0.6.0 \
  --set image.tag=9.0 \
  --set solrOptions.security.authenticationType="Basic" \
  --set solrOptions.javaMemory="-Xms300m -Xmx300m" \
  --set addressability.external.method=Ingress \
  --set addressability.external.domainName="ing.local.domain" \
  --set addressability.external.useExternalAddress="true" \
  --set ingressOptions.ingressClassName="nginx"

I think most of the failure modes seem to be related to when during the startup process Kuberentes exposes enough information (DNS? IP addresses?) to nodes during the startup process -- but I don't quite know Kubernetes networking well enough to debug this.

janhoy commented 1 year ago

...where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181

I see the same on macOS on Apple M1, using Docker Desktop 4.9.1. There is some communication issue betwen the zk nodes, so I always run with only 1 zk when debugging on my dev MacBook.

ramayer commented 1 year ago

@janhoy

Thanks. Adding --set zk.provided.replicas=1 worked for me on the macOS/colima environment. However the minikube-on-ubuntu-when-not-using-containerd example I mentioned above it still fails where there seems to be a different communication failure between the solr nodes when trying to create a 3-shard collection on a 3-node cluster.

risdenk commented 1 year ago

@ramayer can you maybe catch any of the error logs from:

where it seems like the zookeeper cluster never reaches a quorum; apparently timing out when the second zookeeper node attempts to connect to example-solrcloud-zookeeper-client:2181

and

where it seems each Solr instance can communicate with the other two just fine, but appears to have a network timeout when it attempts to communicate with another shard on the same host. I can create some collections, but am unable to creates any collection that has as many shards as solr pod instances.

and any other cases where you seem to have an idea of what is happening. I know you shared how to reproduce (and that is helpful) - any log messages from ZK or Solr would potentially help as well.

ramayer commented 1 year ago

The error messages don't show much except for timeouts that look like this

2022-12-08 21:22:07.442 WARN  (qtp1221991240-60) [] o.a.s.h.a.AdminHandlersProxy Timeout when fetching result from node default-example-solrcloud-1.ing.local.domain:80_solr => java.util.concurrent.TimeoutException
        at java.base/java.util.concurrent.FutureTask.get(Unknown Source)
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.FutureTask.get(Unknown Source) ~[?:?]
        at org.apache.solr.handler.admin.AdminHandlersProxy.maybeProxyToNodes(AdminHandlersProxy.java:112) ~[?:?]
        at org.apache.solr.handler.admin.SystemInfoHandler.handleRequestBody(SystemInfoHandler.java:137) ~[?:?]
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224) ~[?:?]
        at org.apache.solr.handler.admin.InfoHandler.handle(InfoHandler.java:94) ~[?:?]
        at org.apache.solr.handler.admin.InfoHandler.handleRequestBody(InfoHandler.java:82) ~[?:?]
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224) ~[?:?]
        at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:941) ~[?:?]
        at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:893) ~[?:?]
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:584) ~[?:?]
        at org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:250) ~[?:?]
        at org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:218) ~[?:?]
        at org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257) ~[?:?]
  ...

but this is an easy way to reproduce a similar timeout on Minikube on Ubuntu using the docker runtime, and on Colima on Macos, this command

kubectl exec -it pod/example-solrcloud-1 -- curl default-example-solrcloud-0.ing.local.domain/solr/

works fine (returning the Solr admin page as you'd expect), but this

kubectl exec -it pod/example-solrcloud-1 -- curl default-example-solrcloud-1.ing.local.domain/solr/

times out. I think similar timeouts happen when a collection has multiple shards on the same node the client connected to.

ramayer commented 1 year ago

For the minikube failure mode mentioned above, I think it's related to this minikube github issue: https://github.com/kubernetes/minikube/issues/13370 with the workaround from those comments including adding --cni=bridge to the minikube startup line.

I still didn't have any luck finding workarounds for the colima --kubernetes distribution of kuberenets.

Some comments there suggest that iptables-based implementations of services can have trouble when a pod tries to connect to itself through a service-url. I wonder if this operator could be tweaked to have pods not try to reach themselves through the service name --- but don't even understand enough to know if that makes sense.

apache / solr-operator

Solr Operator seems very picky about the Kubernetes environment it's using (guessing networking/dns) #498