apache / pulsar-helm-chart

Official Apache Pulsar Helm Chart
https://pulsar.apache.org/
Apache License 2.0
212 stars 224 forks source link

Flaky TLS Tests #311

Closed michaeljmarshall closed 9 months ago

michaeljmarshall commented 2 years ago

Describe the bug Pulsar Helm Chart (ZK & BK TLS Only) / lint-test (pull_request) and Precommit - Pulsar Helm Chart (ZK TLS Only) / lint-test (pull_request) fail frequently. The errors are often different.

Examples https://github.com/apache/pulsar-helm-chart/actions/runs/3280376295/attempts/3

In that example, I can see this error in the GitHub workflow logs:

2022-10-19T16:35:47.4105412Z pulsar               12m         Warning   Unhealthy                 pod/pulsar-ci-zookeeper-0                                              Liveness probe failed: 139980898686272:error:0200206F:system library:connect:Connection refused:../crypto/bio/b_sock2.c:110:
2022-10-19T16:35:47.4106017Z 139980898686272:error:2008A067:BIO routines:BIO_connect:connect error:../crypto/bio/b_sock2.c:111:
2022-10-19T16:35:47.4106458Z 139980898686272:error:0200206F:system library:connect:Connection refused:../crypto/bio/b_sock2.c:110:
2022-10-19T16:35:47.4106891Z 139980898686272:error:2008A067:BIO routines:BIO_connect:connect error:../crypto/bio/b_sock2.c:111:
2022-10-19T16:35:47.4107202Z connect:errno=111

The error indicates the server isn't listening on the port. However, when I look at the exported logs, I can see the following:

[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T16:23:16,817+0000 [main] INFO  org.apache.zookeeper.server.NettyServerCnxnFactory - binding to port 0.0.0.0/0.0.0.0:2281
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T16:23:16,823+0000 [main] INFO  org.apache.zookeeper.server.NettyServerCnxnFactory - bound to port 2281

It seems like something might be wrong with the environment because sometimes other pods fail for other reasons.

Additional context I'll continue to add examples to this issue as I find them.

michaeljmarshall commented 2 years ago

Here is another failure: https://github.com/apache/pulsar-helm-chart/actions/runs/3280376295/jobs/5410958061

[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] Exception in thread "main" com.google.common.util.concurrent.UncheckedExecutionException: org.apache.bookkeeper.bookie.BookieException$MetadataStoreException: Failed to get cluster instance id
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.tools.cli.commands.bookies.InstanceIdCommand.apply(InstanceIdCommand.java:61)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.bookie.BookieShell$WhatIsInstanceId.runCmd(BookieShell.java:1495)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.bookie.BookieShell$MyCommand.runCmd(BookieShell.java:238)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.bookie.BookieShell.run(BookieShell.java:2278)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.bookie.BookieShell.main(BookieShell.java:2369)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] Caused by: java.util.concurrent.ExecutionException: org.apache.bookkeeper.bookie.BookieException$MetadataStoreException: Failed to get cluster instance id
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.meta.MetadataDrivers.runFunctionWithMetadataBookieDriver(MetadataDrivers.java:355)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.meta.MetadataDrivers.runFunctionWithRegistrationManager(MetadataDrivers.java:375)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.tools.cli.commands.bookies.InstanceIdCommand.apply(InstanceIdCommand.java:49)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     ... 4 more
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] Caused by: org.apache.bookkeeper.bookie.BookieException$MetadataStoreException: Failed to get cluster instance id
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.discover.ZKRegistrationManager.getClusterInstanceId(ZKRegistrationManager.java:429)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.tools.cli.commands.bookies.InstanceIdCommand.lambda$apply$0(InstanceIdCommand.java:52)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.meta.MetadataDrivers.lambda$runFunctionWithRegistrationManager$1(MetadataDrivers.java:375)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.meta.MetadataDrivers.runFunctionWithMetadataBookieDriver(MetadataDrivers.java:350)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     ... 6 more
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid] Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for BookKeeper metadata
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     at org.apache.bookkeeper.discover.ZKRegistrationManager.getClusterInstanceId(ZKRegistrationManager.java:419)
[pod/pulsar-ci-bookie-1/pulsar-bookkeeper-verify-clusterid]     ... 9 more

This failure is surprising given that the zk pod is described as pod/pulsar-ci-zookeeper-0 1/1 Running 0 26m. The ZK logs indicate a number expired sessions. Here is a copy of some o the logs:

[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:08,204+0000 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701bb, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:09,976+0000 [epollEventLoopGroup-7-1] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:14,202+0000 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701bc, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:18,205+0000 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701be, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:19,273+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:20,851+0000 [epollEventLoopGroup-7-1] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:22,467+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-broker,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:23,782+0000 [epollEventLoopGroup-7-1] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:26,203+0000 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701c1, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:26,204+0000 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701c2, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:28,953+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-zookeeper,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:28,954+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.NettyServerCnxn - Processing ruok command from /127.0.0.1:44528
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:32,326+0000 [epollEventLoopGroup-7-1] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-zookeeper,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:32,342+0000 [epollEventLoopGroup-7-1] INFO  org.apache.zookeeper.server.NettyServerCnxn - Processing ruok command from /127.0.0.1:41404
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:34,427+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:37,350+0000 [epollEventLoopGroup-7-1] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-broker,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:37,502+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:38,202+0000 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701c7, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:38,202+0000 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701c6, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:38,203+0000 [epollEventLoopGroup-7-1] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:46,204+0000 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701c9, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:49,930+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:50,871+0000 [epollEventLoopGroup-7-1] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-broker,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:53,015+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:53,699+0000 [epollEventLoopGroup-7-1] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:54,202+0000 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701cd, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:54,202+0000 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer - Expiring session 0x1000005841701cc, timeout of 30000ms exceeded
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:59,022+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-zookeeper,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:56:59,027+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.NettyServerCnxn - Processing ruok command from /127.0.0.1:57552
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:01,805+0000 [epollEventLoopGroup-7-1] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-broker,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:02,216+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-zookeeper,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:02,237+0000 [epollEventLoopGroup-7-2] INFO  org.apache.zookeeper.server.NettyServerCnxn - Processing ruok command from /127.0.0.1:52924
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:04,027+0000 [epollEventLoopGroup-7-1] INFO  org.apache.zookeeper.server.auth.X509AuthenticationProvider - Authenticated Id 'CN=pulsar-ci-bookie,OU=IT Department,O=StreamNative,ST=San Francisco,C=US' for Scheme 'x509'
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:04,216+0000 [epollEventLoopGroup-7-2] ERROR org.apache.zookeeper.server.NettyServerCnxnFactory - Unsuccessful handshake with session 0x0
[pod/pulsar-ci-zookeeper-0/pulsar-ci-zookeeper] 2022-10-19T09:57:04,568+0000 [epollEventLoopGroup-7-1] ERROR org.apache.zookeeper.server.NettyServerCnxnFactory - Unsuccessful handshake with session 0x0
michaeljmarshall commented 2 years ago

Just discovered that one of the errors in the bookie-init job are expected:

            bin/apply-config-from-env.py conf/bookkeeper.conf; /pulsar/keytool/keytool.sh toolset ${HOSTNAME}.pulsar-ci-toolset.pulsar.svc.cluster.local true; if bin/bookkeeper shell whatisinstanceid; then
                echo "bookkeeper cluster already initialized";
            else
                bin/bookkeeper shell initnewcluster;
            fi
lhotari commented 9 months ago

Resolved.