apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.13k stars 3.57k forks source link

Flaky-test: PulsarSinkE2ETest.shutdown(org.apache.pulsar.io.PulsarSinkE2ETest) #13954

Open michaeljmarshall opened 2 years ago

michaeljmarshall commented 2 years ago

PulsarSinkE2ETest is flaky. The shutdown(org.apache.pulsar.io.PulsarSinkE2ETest) test method fails sporadically.

https://github.com/apache/pulsar/runs/4942620767?check_suite_focus=true

Error:  shutdown(org.apache.pulsar.io.PulsarSinkE2ETest)  Time elapsed: 10.1 s  <<< FAILURE!
java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /loadbalance/brokers/localhost:34747
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
    at java.base/java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:777)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
    at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
    at org.apache.pulsar.metadata.impl.ZKMetadataStore.handleDeleteResult(ZKMetadataStore.java:262)
    at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$batchOperation$6(ZKMetadataStore.java:174)
    at org.apache.pulsar.metadata.impl.PulsarZooKeeperClient$3$1.processResult(PulsarZooKeeperClient.java:490)
    at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:722)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:563)
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /loadbalance/brokers/localhost:34747
    at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:420)
    ... 5 more
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /loadbalance/brokers/localhost:34747
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:122)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
    at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:416)
    ... 5 more
lhotari commented 2 years ago

Is it possible that this is a real production issue? #13892 has been reported with a similar stacktrace.

I also spotted a similar stacktrace in https://github.com/apache/pulsar/runs/4989586769?check_suite_focus=true#step:9:1975

Error:  shutdown(org.apache.pulsar.broker.loadbalance.SimpleLoadManagerImplTest)  Time elapsed: 0.313 s  <<< FAILURE!
java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /loadbalance/brokers/localhost:35469
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
    at java.base/java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:777)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
    at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
    at org.apache.pulsar.metadata.impl.ZKMetadataStore.handleDeleteResult(ZKMetadataStore.java:264)
    at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$batchOperation$6(ZKMetadataStore.java:176)
    at org.apache.pulsar.metadata.impl.PulsarZooKeeperClient$3$1.processResult(PulsarZooKeeperClient.java:490)
    at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:722)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:563)
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /loadbalance/brokers/localhost:35469
    at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:422)
    ... 5 more
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /loadbalance/brokers/localhost:35469
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:122)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
    at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:418)
    ... 5 more
michaeljmarshall commented 2 years ago

@lhotari - yes, it is definitely possible. I didn't have time to inspect the stack trace when I created this issue, so I haven't looked into it yet. However, given the BadVersion exception, it is probably an issue that should take priority.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

tisonkun commented 2 years ago

Could you verify whether this exception occurs while it suffers a ZK disconnection? If there is a period of disconnection, it's possible that the op success on server but fail to response, and since we retry on CONNECTIONLOSS then it can be this case. #13892 can be in such situation, cc @MichalKoziorowski-TomTom

// ZKMetadataStore#batchOperation 8914b84115bacaf38fe892d66533c1b70431acbf
                if (results == null) {
                    Code code = Code.get(rc);
                    if (code == Code.CONNECTIONLOSS) {
                        // There is the chance that we caused a connection reset by sending or requesting a batch
                        // that passed the max ZK limit. Retry with the individual operations
                        executor.schedule(() -> {
                            ops.forEach(o -> batchOperation(Collections.singletonList(o)));
                        }, 100, TimeUnit.MILLISECONDS);
                    } else {
                        MetadataStoreException e = getException(code, path);
                        ops.forEach(o -> o.getFuture().completeExceptionally(e));
                    }
                    return;
                }