Open michaeljmarshall opened 2 years ago
Is it possible that this is a real production issue? #13892 has been reported with a similar stacktrace.
I also spotted a similar stacktrace in https://github.com/apache/pulsar/runs/4989586769?check_suite_focus=true#step:9:1975
Error: shutdown(org.apache.pulsar.broker.loadbalance.SimpleLoadManagerImplTest) Time elapsed: 0.313 s <<< FAILURE!
java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /loadbalance/brokers/localhost:35469
at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
at java.base/java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:777)
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
at org.apache.pulsar.metadata.impl.ZKMetadataStore.handleDeleteResult(ZKMetadataStore.java:264)
at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$batchOperation$6(ZKMetadataStore.java:176)
at org.apache.pulsar.metadata.impl.PulsarZooKeeperClient$3$1.processResult(PulsarZooKeeperClient.java:490)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:722)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:563)
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /loadbalance/brokers/localhost:35469
at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:422)
... 5 more
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /loadbalance/brokers/localhost:35469
at org.apache.zookeeper.KeeperException.create(KeeperException.java:122)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:418)
... 5 more
@lhotari - yes, it is definitely possible. I didn't have time to inspect the stack trace when I created this issue, so I haven't looked into it yet. However, given the BadVersion
exception, it is probably an issue that should take priority.
The issue had no activity for 30 days, mark with Stale label.
The issue had no activity for 30 days, mark with Stale label.
Could you verify whether this exception occurs while it suffers a ZK disconnection? If there is a period of disconnection, it's possible that the op success on server but fail to response, and since we retry on CONNECTIONLOSS then it can be this case. #13892 can be in such situation, cc @MichalKoziorowski-TomTom
// ZKMetadataStore#batchOperation 8914b84115bacaf38fe892d66533c1b70431acbf
if (results == null) {
Code code = Code.get(rc);
if (code == Code.CONNECTIONLOSS) {
// There is the chance that we caused a connection reset by sending or requesting a batch
// that passed the max ZK limit. Retry with the individual operations
executor.schedule(() -> {
ops.forEach(o -> batchOperation(Collections.singletonList(o)));
}, 100, TimeUnit.MILLISECONDS);
} else {
MetadataStoreException e = getException(code, path);
ops.forEach(o -> o.getFuture().completeExceptionally(e));
}
return;
}
PulsarSinkE2ETest is flaky. The shutdown(org.apache.pulsar.io.PulsarSinkE2ETest) test method fails sporadically.
https://github.com/apache/pulsar/runs/4942620767?check_suite_focus=true