Open kriston13 opened 4 years ago
Oh, some additional info.
My first attempt, this was going into a namespace that has istio injection enabled. The operators are still in there.
However, my second attempt was to create the Solr-cloud
resource in a brand new namespace that does NOT have istio.
Yeah it looks like it's an issue with the Zookeeper cluster not being created.
It looks like there's no stateful set there for the ZK. Can you kubectl get zookeepercluster
and maybe look at the logs of the Zookeeper Operator pod?
Been doing more investigation.
The zk pods don't seem to have been being created when I was creating the SolrCloud
instance through the Solr-operator
Name: solr-operator-68f7f897d5-xtc4m
Namespace: operators
Priority: 0
Node: gke-srchy-non-prod-medium-pool-ssd-4088ed54-m9l3/10.128.0.45
Start Time: Mon, 15 Jun 2020 23:37:33 +1000
Labels: control-plane=solr-operator
controller-tools.k8s.io=1.0
pod-template-hash=68f7f897d5
Annotations: prometheus.io/scrape: true
Status: Running
IP: 10.40.6.9
IPs: <none>
Controlled By: ReplicaSet/solr-operator-68f7f897d5
Containers:
solr-operator:
Container ID: docker://5d9796f79a1ba878ed9aa508790dc86c550e9a73d1e4e618177e8945f35e0b72
Image: bloomberg/solr-operator:v0.2.5
Image ID: docker-pullable://bloomberg/solr-operator@sha256:11549df6061c187ef54216b81cc0fd607275e36b1e1bcd0255f39272f0c2c52d
Port: <none>
Host Port: <none>
Args:
-zk-operator=true
-etcd-operator=false
-ingress-base-domain=
State: Running
Started: Mon, 15 Jun 2020 23:37:34 +1000
Ready: True
Restart Count: 0
Limits:
cpu: 400m
memory: 500Mi
Requests:
cpu: 100m
memory: 100Mi
Environment:
POD_NAMESPACE: operators (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from solr-operator-token-bqwnp (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
solr-operator-token-bqwnp:
Type: Secret (a volume populated by a Secret)
SecretName: solr-operator-token-bqwnp
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
I can see that the -zkOperator
flag is set to true
(which is different from the documentation which states it defaults to false
.
However - the Solr Cloud pods are generated.
Looking through the zk pod, I could see that the zk operator was having a lot of permission errors - also, in order to get zk to work with istio
, I needed to set a flag to enable quorumListenOnAllIPs: true
. So I really needed more configuration of zookeeper than I could find from the Solr-operator zookeeper operator setup. (Also a possibility that I still don't understand operators well enough)
So I went to the actual zookeeper operator page, found their helm chart, installed that.
Then now, I setup my Solr Cloud helm chart to have the following templates
---
# Source: solr-cloud/templates/solrcloud.yaml
apiVersion: solr.bloomberg.com/v1beta1
kind: SolrCloud
metadata:
name: solr-cloud
spec:
replicas: 4
solrImage:
tag: 8.1.1
zookeeperRef:
connectionInfo:
internalConnectionString: boo-solr-zk-client:2181
chRoot: /solr
---
# Source: solr-cloud/templates/zookeeper.yaml
apiVersion: zookeeper.pravega.io/v1beta1
kind: ZookeeperCluster
metadata:
name: "boo-solr-zk"
spec:
replicas: 3
config:
initLimit: 10
tickTime: 2000
syncLimit: 5
quorumListenOnAllIPs: true
This SEEMS to work.
I can port-forward to view the admin console and everything! And my zookeeper instances don't seem to be having errors (quite a few warnings).
I've just done a diff of the definitions from the ZK helm chart, vs the one in your dependency yaml, and the differences are mostly related to the service accounts & cluster role binding. Specifically, the following all exist in the zk helm chart, and not in yours:
ServiceAccount
ClusterRole
ClusterRoleBinding
RoleBinding
I'm not really sure how to fix this... Gut feel is that the zookeeper-operator helm chart needs to make its way into your Solr-operator helm chart as a dependency (which I know is something you have flagged).
And this still won't fix the issues with istio requiring zookeeper to have the quorumListenOnAllIPs: true
flag.
I'm assuming that'd require a flag in how you set up the SolrCloud
CRD.
All things I don't know how to do, so I'm not sure if I can help any further at this point. :(
I'm including here a copy of the rendered template from the zookeeper helm chart. Hopefully that's useful.
This is very helpful and detailed, thank you so much!
Yeah, the goal is to have the ZK chart as a dependency, but they don't publish their chart yet. Once they start doing that, it will be a no brainer (https://github.com/pravega/zookeeper-operator/issues/172). It's very strange that the ServiceAccount
, ClusterRole
and ClusterRoleBinding
didn't show up though, they are in that example dependency file. It might be because of the virsion of the rbac objects. Ours are v1beta1 instead of v1. Will do some more investigation into that.
And yeah, the Solr operator generally lags behind the available Zookeeper spec availabilities a bit. But we try to update it to the most recent options as often as possible. We'll try to get it updated soon. Probably when the next release of the ZK Operator (0.2.8) happens, which is when it looks like the Istio stuff will formally be released.
We're running into similar symptoms: We use a Helm chart to create a solr-cloud CRD, and the Helm chart returns while the pod is not yet in a Ready state. We find that eventually the solr-cloud pod and the associated StatefulSet will go ready, but it can take ten minutes. We will try out @kriston13's workaround described above (manually creating and referring to the zookeeper instance) and see if that helps.
@HoustonPutman you said above:
Yeah, the goal is to have the ZK chart as a dependency, but they don't publish their chart yet. Once they start doing that, it will be a no brainer (pravega/zookeeper-operator#172).
They are publishing their chart now, as of September 17th. Here's the spec we're using for grabbing that chart (from Terraform):
resource "helm_release" "zookeeper" {
name = "zookeeper"
chart = "zookeeper-operator"
repository = "https://charts.pravega.io/"
namespace = data.kubernetes_namespace.operator.id
cleanup_on_fail = "true"
atomic = "true"
}
This is what it typically looks like for us (again without having explicitly created and bound the zookeeper instance, just creating the CRD instance):
NAME READY STATUS RESTARTS AGE
pod/example-solrcloud-0 0/1 Running 6 5m35s
pod/example-solrcloud-zookeeper-0 1/1 Running 0 6m4s
pod/example-solrcloud-zookeeper-1 1/1 Running 0 5m8s
pod/example-solrcloud-zookeeper-2 1/1 Running 0 4m28s
pod/solr-solr-operator-79558f9656-z64kh 1/1 Running 0 6m54s
pod/zookeeper-zookeeper-operator-99567cf7f-fzj8l 1/1 Running 0 8m6s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/example-solrcloud-0 ClusterIP 10.101.101.24 <none> 80/TCP 6m6s
service/example-solrcloud-1 ClusterIP 10.103.19.135 <none> 80/TCP 6m6s
service/example-solrcloud-2 ClusterIP 10.98.177.23 <none> 80/TCP 6m6s
service/example-solrcloud-common ClusterIP 10.106.40.65 <none> 80/TCP 6m6s
service/example-solrcloud-zookeeper-client ClusterIP 10.102.186.9 <none> 2181/TCP 6m6s
service/example-solrcloud-zookeeper-headless ClusterIP None <none> 2181/TCP,2888/TCP,3888/TCP,7000/TCP 6m6s
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 21d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/solr-solr-operator 1/1 1 1 6m57s
deployment.apps/zookeeper-zookeeper-operator 1/1 1 1 8m15s
NAME DESIRED CURRENT READY AGE
replicaset.apps/solr-solr-operator-79558f9656 1 1 1 6m57s
replicaset.apps/zookeeper-zookeeper-operator-99567cf7f 1 1 1 8m15s
NAME READY AGE
statefulset.apps/example-solrcloud 0/3 5m36s
statefulset.apps/example-solrcloud-zookeeper 3/3 6m6s
When we describe the pod we get the following:
% kubectl describe pod/example-solrcloud-0
Name: example-solrcloud-0
Namespace: default
Priority: 0
Node: docker-desktop/192.168.65.3
Start Time: Thu, 22 Oct 2020 11:45:45 -0700
Labels: app.kubernetes.io/managed-by=Helm
controller-revision-hash=example-solrcloud-5d7749748b
solr-cloud=example
statefulset.kubernetes.io/pod-name=example-solrcloud-0
technology=solr-cloud
Annotations: <none>
Status: Running
IP: 10.1.0.186
IPs:
IP: 10.1.0.186
Controlled By: StatefulSet/example-solrcloud
Init Containers:
cp-solr-xml:
Container ID: docker://b89979c12fffac92f8551513d57431111d565bdae1cfe667d7a268ca7c593619
Image: library/busybox:1.28.0-glibc
Image ID: docker-pullable://busybox@sha256:0b55a30394294ab23b9afd58fab94e61a923f5834fba7ddbae7f8e0c11ba85e6
Port: <none>
Host Port: <none>
Command:
sh
-c
cp /tmp/solr.xml /tmp-config/solr.xml
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 22 Oct 2020 11:45:52 -0700
Finished: Thu, 22 Oct 2020 11:45:52 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/tmp from solr-xml (rw)
/tmp-config from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qqlxv (ro)
Containers:
solrcloud-node:
Container ID: docker://a3f8a75978b344c18ce03e4b57b6e3b89d46efc57c832f8a5a17447c81ba98ca
Image: docker.io/solr:8.6
Image ID: docker-pullable://solr@sha256:4bc85b046a26fde8e62a6cc27afd0f3e8fa7e3092736f737b46b75e995f8de74
Port: 8983/TCP
Host Port: 0/TCP
Args:
-DhostPort=80
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 143
Started: Thu, 22 Oct 2020 11:50:54 -0700
Finished: Thu, 22 Oct 2020 11:51:43 -0700
Ready: False
Restart Count: 6
Liveness: http-get http://:8983/solr/admin/info/system delay=20s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:8983/solr/admin/info/system delay=15s timeout=1s period=5s #success=1 #failure=3
Environment:
SOLR_JAVA_MEM: -Xms300m -Xmx300m
SOLR_HOME: /var/solr/data
SOLR_PORT: 8983
POD_HOSTNAME: example-solrcloud-0 (v1:metadata.name)
SOLR_HOST: default-$(POD_HOSTNAME).ing.local.domain
ZK_HOST: example-solrcloud-zookeeper-client:2181/
SOLR_LOG_LEVEL: INFO
SOLR_OPTS:
GC_TUNE:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qqlxv (ro)
/var/solr/data from data (rw)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
solr-xml:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: example-solrcloud-configmap
Optional: false
data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
default-token-qqlxv:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qqlxv
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m57s default-scheduler Successfully assigned default/example-solrcloud-0 to docker-desktop
Normal Pulled 6m52s kubelet Container image "library/busybox:1.28.0-glibc" already present on machine
Normal Created 6m52s kubelet Created container cp-solr-xml
Normal Started 6m51s kubelet Started container cp-solr-xml
Normal Killing 6m1s kubelet Container solrcloud-node failed liveness probe, will be restarted
Normal Pulled 5m59s (x2 over 6m50s) kubelet Container image "docker.io/solr:8.6" already present on machine
Normal Created 5m59s (x2 over 6m49s) kubelet Created container solrcloud-node
Normal Started 5m58s (x2 over 6m47s) kubelet Started container solrcloud-node
Warning Unhealthy 5m27s (x11 over 6m32s) kubelet Readiness probe failed: Get http://10.1.0.186:8983/solr/admin/info/system: dial tcp 10.1.0.186:8983: connect: connection refused
Warning Unhealthy 111s (x18 over 6m21s) kubelet Liveness probe failed: Get http://10.1.0.186:8983/solr/admin/info/system: dial tcp 10.1.0.186:8983: connect: connection refused
Describe the bug Unable to create a Solr Cloud instance due to the first pod dying with liveness probe issues - GKE
To Reproduce Deploy the zookeeper operator as per the instructions
Install the solr operator - as per the instructions (This is installed into a custom namespace, not default)
Create the new solr cloud instance. I actually used a custom helm chart for this... It is super simple, and basically a rejig of the tutorial version
Then now when I try to check what's created...
When I describe the pod, I get the following
Expected behavior Pods to be deployed and a Solr Cloud instance to be stood up.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information): GKE 1.14.10-gke.36
Additional context
Taking a look in the Solr pod's logs, I see these errors relating the zookeeper. I'm guessing that the zookeeper pods (and services) aren't being stood up before the solr pods, and that's what is causing the warnings & the failures. But.... I'm brand new the operators and Solr Cloud, so I'm not 100% sure.