Closed michaeljmarshall closed 9 months ago
Here is another failure: https://github.com/apache/pulsar-helm-chart/actions/runs/3280376295/jobs/5410958061
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] Exception in thread "main" com.google.common.util.concurrent.UncheckedExecutionException: org.apache.bookkeeper.bookie.BookieException$MetadataStoreException: Failed to get cluster instance id
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.tools.cli.commands.bookies.InstanceIdCommand.apply(InstanceIdCommand.java:61)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.bookie.BookieShell$WhatIsInstanceId.runCmd(BookieShell.java:1495)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.bookie.BookieShell$MyCommand.runCmd(BookieShell.java:238)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.bookie.BookieShell.run(BookieShell.java:2278)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.bookie.BookieShell.main(BookieShell.java:2369)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] Caused by: java.util.concurrent.ExecutionException: org.apache.bookkeeper.bookie.BookieException$MetadataStoreException: Failed to get cluster instance id
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.meta.MetadataDrivers.runFunctionWithMetadataBookieDriver(MetadataDrivers.java:355)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.meta.MetadataDrivers.runFunctionWithRegistrationManager(MetadataDrivers.java:375)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.tools.cli.commands.bookies.InstanceIdCommand.apply(InstanceIdCommand.java:49)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] ... 4 more
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] Caused by: org.apache.bookkeeper.bookie.BookieException$MetadataStoreException: Failed to get cluster instance id
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.discover.ZKRegistrationManager.getClusterInstanceId(ZKRegistrationManager.java:429)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.tools.cli.commands.bookies.InstanceIdCommand.lambda$apply$0(InstanceIdCommand.java:52)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.meta.MetadataDrivers.lambda$runFunctionWithRegistrationManager$1(MetadataDrivers.java:375)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.meta.MetadataDrivers.runFunctionWithMetadataBookieDriver(MetadataDrivers.java:350)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] ... 6 more
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for BookKeeper metadata
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] at org.apache.bookkeeper.discover.ZKRegistrationManager.getClusterInstanceId(ZKRegistrationManager.java:419)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] ... 9 more
This failure is surprising given that the zk pod is described as pod/pulsar-ci-zookeeper-0 1/1 Running 0 26m
. The ZK logs indicate a number expired sessions. Here is a copy of some o the logs:
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:08,204+0000 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701bb, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:09,976+0000 [epollEventLoopGroup-7-1] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:14,202+0000 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701bc, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:18,205+0000 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701be, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:19,273+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:20,851+0000 [epollEventLoopGroup-7-1] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:22,467+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-broker,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:23,782+0000 [epollEventLoopGroup-7-1] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:26,203+0000 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701c1, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:26,204+0000 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701c2, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:28,953+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-zookeeper,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:28,954+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.NettyServerCnxn - Processing ruok command from /127.0.0.1:44528
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:32,326+0000 [epollEventLoopGroup-7-1] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-zookeeper,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:32,342+0000 [epollEventLoopGroup-7-1] INFO org.apache.zookeeper.server.NettyServerCnxn - Processing ruok command from /127.0.0.1:41404
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:34,427+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:37,350+0000 [epollEventLoopGroup-7-1] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-broker,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:37,502+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:38,202+0000 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701c7, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:38,202+0000 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701c6, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:38,203+0000 [epollEventLoopGroup-7-1] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:46,204+0000 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701c9, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:49,930+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:50,871+0000 [epollEventLoopGroup-7-1] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-broker,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:53,015+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:53,699+0000 [epollEventLoopGroup-7-1] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:54,202+0000 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701cd, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:54,202+0000 [SessionTracker] INFO org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701cc, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:59,022+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-zookeeper,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:59,027+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.NettyServerCnxn - Processing ruok command from /127.0.0.1:57552
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:01,805+0000 [epollEventLoopGroup-7-1] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-broker,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:02,216+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-zookeeper,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:02,237+0000 [epollEventLoopGroup-7-2] INFO org.apache.zookeeper.server.NettyServerCnxn - Processing ruok command from /127.0.0.1:52924
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:04,027+0000 [epollEventLoopGroup-7-1] INFO org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:04,216+0000 [epollEventLoopGroup-7-2] ERROR org.apache.zookeeper.server.NettyServerCnxnFactory - Unsuccessful handshake with session 0x0
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:04,568+0000 [epollEventLoopGroup-7-1] ERROR org.apache.zookeeper.server.NettyServerCnxnFactory - Unsuccessful handshake with session 0x0
Just discovered that one of the errors in the bookie-init job are expected:
bin/apply-config-from-env.py conf/bookkeeper.conf; /pulsar/keytool/keytool.sh toolset ${HOSTNAME}.pulsar-ci-toolset.pulsar.svc.cluster.local true; if bin/bookkeeper shell whatisinstanceid; then
echo "bookkeeper cluster already initialized";
else
bin/bookkeeper shell initnewcluster;
fi
Resolved.
Describe the bug
Pulsar Helm Chart (ZK & BK TLS Only) / lint-test (pull_request)
andPrecommit - Pulsar Helm Chart (ZK TLS Only) / lint-test (pull_request)
fail frequently. The errors are often different.Examples https://github.com/apache/pulsar-helm-chart/actions/runs/3280376295/attempts/3
In that example, I can see this error in the GitHub workflow logs:
The error indicates the server isn't listening on the port. However, when I look at the exported logs, I can see the following:
It seems like something might be wrong with the environment because sometimes other pods fail for other reasons.
Additional context I'll continue to add examples to this issue as I find them.