apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.19k stars 3.58k forks source link

Pulsar won't starting on kubernetes, some pods in init state #11277

Open archenroot opened 3 years ago

archenroot commented 3 years ago

Describe the bug I am running on qemu (libvirt) vagrant k8s cluster with 1 master and 2 nodes with following config (just for imagination that it has enough resources):

DISK_COUNT ?= 1
DISK_SIZE_GB ?= 150
# VM Resources
MASTER_CPUS ?= 2
MASTER_MEMORY_SIZE_GB ?= 12
NODE_CPUS ?= 6
NODE_MEMORY_SIZE_GB ?= 32
NODE_COUNT ?= 2

I use following values file (customized from examples): https://gist.github.com/archenroot/c6c15b957758226473530825deae7649

I use following sequence to install pulsar on k8s (tls is disabled as there was some additional issue with webhook):

#!/usr/bin/env bash

helm repo add apache https://pulsar.apache.org/charts
helm repo update

#kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.4.0/cert-manager.yaml
#git clone https://github.com/apache/pulsar-helm-chart
cd pulsar-helm-chart

#sh scripts/cert-manager/install-cert-manager.sh
sh scripts/pulsar/prepare_helm_release.sh -n pulsar -k pulsar-mini -c \
--pulsar-superusers superadmin,proxy-admin,broker-admin,client-admin,admin
cd ..
rm -rf pulsar-helm-chart

helm install \
--values values-andromeda-local-cluster.yaml \
--namespace pulsar \
pulsar-mini apache/pulsar

During my tests I experienced also following error with replica set to 1 for bookkeeper, zookepier and broker: apache pulsar statefulsets.apps does no t implement the scale subresource on

But at moment I have 2 replicas config (as per value file) and played bit with disabling enabling components and after about 15 minutes pulsar namespace looks like this: zangetsu@andromeda ~ $ kubectl get pods -n pulsar

NAME                                         READY   STATUS     RESTARTS   AGE
pulsar-mini-broker-0                         0/1     Init:0/2   0          13m
pulsar-mini-proxy-0                          0/1     Init:0/2   0          8m16s
pulsar-mini-pulsar-init-d6r46                0/1     Init:0/3   0          14m
pulsar-mini-pulsar-manager-6c6889dff-brggz   1/1     Running    0          8m16s
pulsar-mini-toolset-0                        1/1     Running    0          8m16s
pulsar-mini-zookeeper-0                      1/1     Running    0          30m

kubectl describe for all pods in Init state here:

zangetsu@andromeda ~ $ gh gist create pulsar-mini-broker-0.pod 
- Creating gist pulsar-mini-broker-0.pod
✓ Created gist pulsar-mini-broker-0.pod
https://gist.github.com/9ca2ee545aaae46c3a689dcd7c70f53d
zangetsu@andromeda ~ $ gh gist create pulsar-mini-proxy-0.pod 
- Creating gist pulsar-mini-proxy-0.pod
✓ Created gist pulsar-mini-proxy-0.pod
https://gist.github.com/fdcb20259cd58eb02b5541feace9694e
zangetsu@andromeda ~ $ gh gist create pulsar-mini-pulsar-init-d6r46.pod 
- Creating gist pulsar-mini-pulsar-init-d6r46.pod
✓ Created gist pulsar-mini-pulsar-init-d6r46.pod
https://gist.github.com/c08a7dab67f601a7289bc7338d20d07a

Expected behavior Pulsar is up and running...

Screenshots Pods in octant image StatefulSets image

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

zangetsu@andromeda ~ $ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:52:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

I am bit lost about where to look for possible issue

archenroot commented 3 years ago

Ok, I understand its in waiting state for bookkeeper and zookeeper after examining:

Image | apachepulsar/pulsar-all:2.7.2
-- | --
Image ID | docker-pullable://apachepulsar/pulsar-all@sha256:96d56238cbf57379b4d09f53e73bfb323787a6d79b36044276a515bb031c2218
Command | ['sh', '-c']
Args | [' until bin/bookkeeper org.apache.zookeeper.ZooKeeperMain -server pulsar-cs-zookeeper:2181 get /admin/clusters/pulsar-mini; do echo "pulsar cluster pulsar-mini isn't initialized yet ... check in 3 seconds ..." && sleep 3; done;']

I wonder pulsar-cs-zookeper is valid address in kubernetes cluster its should be using service cluster name: pulsar-mini-zookeeper.svc.cluster.local and not pulsar-cs-zookeeper

But I will need to examine pod networking first...

archenroot commented 3 years ago

So found tested connectivity on zookeeper pod itself via service cluster URL:

root@pulsar-mini-zookeeper-0:/pulsar# telnet pulsar-mini-zookeeper.pulsar.svc.cluster.local 2181
Trying 10.222.104.7...
Connected to pulsar-mini-zookeeper.pulsar.svc.cluster.local.
Escape character is '^]'.
stats
Zookeeper version: 3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built on 02/10/2020 11:30 GMT
Clients:
 /10.222.104.7:50250[1](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/6/66
Received: 2254
Sent: 2253
Connections: 1
Outstanding: 0
Zxid: 0x1000009a1
Mode: follower
Node count: 5
Connection closed by foreign host.

So it works at least, so the other pod is having issue with connectivity. I searched for the command prefix from wait-for-zookeper pod: command prefix: [' until bin/bookkeeper org.apache.zookeeper.ZooKeeperMain -server pulsar-cs-zookeeper:2181 get /admin/clusters/pulsar-mini; do echo "pulsar cluster pulsar-mini isn't initialized yet ... check in 3 seconds ..." && sleep 3; done;']

I think this is the issue: -server pulsar-cs-zookeeper:2181

Its not reachable even from zookeeper itself: root@pulsar-mini-zookeeper-0:/pulsar# ping pulsar-cs-zookeeper ping: pulsar-cs-zookeeper: Name or service not known

So I searched where is it coming from

zangetsu@andromeda ~/proj/infrastructure/k8s-vagrant-multi-node_archenroot/k8s/apache-pulsar $ grep -R "until bin/bookkeeper"
pulsar-helm-chart/charts/pulsar/templates/_autorecovery.tpl:until bin/bookkeeper shell whatisinstanceid; do
pulsar-helm-chart/charts/pulsar/templates/pulsar-cluster-initialize.yaml:          until bin/bookkeeper shell whatisinstanceid; do
pulsar-helm-chart/charts/pulsar/templates/_bookkeeper.tpl:until bin/bookkeeper shell whatisinstanceid; do
pulsar-helm-chart/charts/pulsar/templates/_bookkeeper.tpl:until bin/bookkeeper shell whatisinstanceid; do
pulsar-helm-chart/charts/pulsar/templates/broker-statefulset.yaml:            until bin/bookkeeper org.apache.zookeeper.ZooKeeperMain -server {{ template "pulsar.configurationStore.connect" . }} get {{ .Values.configurationStoreMetadataPrefix }}/admin/clusters/{{ template "pulsar.cluster.name" . }}; do
pulsar-helm-chart/charts/pulsar/templates/broker-statefulset.yaml:            until bin/bookkeeper org.apache.zookeeper.ZooKeeperMain -server {{ template "pulsar.zookeeper.connect" . }} get {{ .Values.metadataPrefix }}/admin/clusters/{{ template "pulsar.cluster.name" . }}; do
pulsar-helm-chart/charts/pulsar/templates/broker-statefulset.yaml:            until bin/bookkeeper shell whatisinstanceid; do

Search more:

zangetsu@andromeda ~/proj/infrastructure/k8s-vagrant-multi-node_archenroot/k8s/apache-pulsar $ grep -R "pulsar.zookeeper.connect"
pulsar-helm-chart/charts/pulsar/templates/_zookeeper.tpl:{{- define "pulsar.zookeeper.connect" -}}

So in the template file its defined as:

{{/*
Define the pulsar zookeeper
*/}}
{{- define "pulsar.zookeeper.connect" -}}
{{$zk:=.Values.pulsar_metadata.userProvidedZookeepers}}
{{- if and (not .Values.components.zookeeper) $zk }}
{{- $zk -}}
{{ else }}
{{- if not (and .Values.tls.enabled .Values.tls.zookeeper.enabled) -}}
{{ template "pulsar.zookeeper.service" . }}:{{ .Values.zookeeper.ports.client }}
{{- end -}}
{{- if and .Values.tls.enabled .Values.tls.zookeeper.enabled -}}
{{ template "pulsar.zookeeper.service" . }}:{{ .Values.zookeeper.ports.clientTls }}
{{- end -}}
{{- end -}}
{{- end -}}

So I try to set userProvidedZookeepers

pulsar_metadata:
  configurationStore: pulsar-cs-zookeeper
  configurationStoreMetadataPrefix: "/configuration-store"
  userProvidedZookeepers: "pulsar-mini-zookeeper.pulsar.svc.cluster.local:2181"
archenroot commented 3 years ago

So I finally got this command working on zookeeper pod: root@pulsar-mini-zookeeper-0:/pulsar# until bin/bookkeeper org.apache.zookeeper.ZooKeeperMain -server pulsar-mini-zookeeper.pulsar.svc.cluster.local:2181 get /admin/clusters/pulsar-mini; do echo "pulsar cluster pulsar-mini isn't initialized yet ... check in 3 seconds ..." && sleep 3; done;

It results in following error: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /admin/clusters/pulsar-mini would be really happy to see this error somewhere :-))) to not need to dig so deep

archenroot commented 3 years ago

ref: https://github.com/apache/pulsar/issues/4480 I am getting now suspicious that its the metadata of the cluster whats being not in good shape:

   - >
            {{- include "pulsar.toolset.zookeeper.tls.settings" . | nindent 12 }}
            bin/pulsar initialize-cluster-metadata \
              --cluster {{ template "pulsar.cluster.name" . }} \
              --zookeeper {{ template "pulsar.zookeeper.connect" . }}{{ .Values.metadataPrefix }} \
              {{- if .Values.pulsar_metadata.configurationStore }}
              --configuration-store {{ template "pulsar.configurationStore.connect" . }}{{ .Values.pulsar_metadata.configurationStoreMetadataPrefix }} \
              {{- end }}
              {{- if not .Values.pulsar_metadata.configurationStore }}
              --configuration-store {{ template "pulsar.zookeeper.connect" . }}{{ .Values.metadataPrefix }} \
              {{- end }}
              --web-service-url http://{{ template "pulsar.fullname" . }}-{{ .Values.broker.component }}.{{ template "pulsar.namespace" . }}.svc.{{ .Values.clusterDomain }}:{{ .Values.broker.ports.http }}/ \
              --web-service-url-tls https://{{ template "pulsar.fullname" . }}-{{ .Values.broker.component }}.{{ template "pulsar.namespace" . }}.svc.{{ .Values.clusterDomain }}:{{ .Values.broker.ports.https }}/ \
              --broker-service-url pulsar://{{ template "pulsar.fullname" . }}-{{ .Values.broker.component }}.{{ template "pulsar.namespace" . }}.svc.{{ .Values.clusterDomain }}:{{ .Values.broker.ports.pulsar }}/ \
              --broker-service-url-tls pulsar+ssl://{{ template "pulsar.fullname" . }}-{{ .Values.broker.component }}.{{ template "pulsar.namespace" . }}.svc.{{ .Values.clusterDomain }}:{{ .Values.broker.ports.pulsarssl }}/ || true;

This is pulsar-cluster-initialize.yaml file where metadata gets initiated

Above script is suspicious to me from fact that its providing both nonTLS and TLS URL endpoints, but maybe pulsar can handle this. I mean I have TLS disabled in values file, so I shouldn't be seeing any kind of https ....

archenroot commented 3 years ago

I am not able to figure out, but on zookeeper logs when starting I see another suspicious messages:

pulsar-mini-zookeeper
21:54:01.902 [WorkerSender[myid=1]] WARN org.apache.zookeeper.server.quorum.QuorumPeer - Failed to resolve address: pulsar-mini-zookeeper-1.pulsar-mini-zookeeper.pulsar.svc.cluster.local
pulsar-mini-zookeeper
java.net.UnknownHostException: pulsar-mini-zookeeper-1.pulsar-mini-zookeeper.pulsar.svc.cluster.local
pulsar-mini-zookeeper
    at java.net.InetAddress.getAllByName0(InetAddress.java:1281) ~[?:1.8.0_282]
pulsar-mini-zookeeper
    at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_282]
pulsar-mini-zookeeper
    at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[?:1.8.0_282]

The message is WARN only, but the address which is trying to connect to is wrong. pulsar-mini-zookeeper string shoulnd't be there.

archenroot commented 3 years ago

So, it seems when I comment out the following configuration:

#metadataPrefix: "/cluster1"

pulsar_metadata:
#  configurationStore: pulsar-cs-zookeeper
#  pulsar-cs-zookeeper
#  configurationStoreMetadataPrefix: "/configuration-store"

I only enabled limited components set image I will continue with uncommenting and redeploying to see what attribute causes failures.

I also need to enable ingress nodeport for services so I can easily access from localhost for testing.

archenroot commented 3 years ago

So with enabling metadataPrefix: "/cluster1" the cluster won't start: image

archenroot commented 3 years ago

same failed state observed with:

pulsar_metadata:
  configurationStore: pulsar-cs-zookeeper

So these metadataPrefix and configurationStore enabled causing clsuter not initialize.

codelipenghui commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.