Open zbentley opened 3 years ago
The following issues were all observed in response to similar testing: https://github.com/apache/pulsar-client-cpp/issues/86 https://github.com/apache/pulsar/issues/12556 https://github.com/apache/pulsar/issues/12555 https://github.com/apache/pulsar/issues/12554 https://github.com/apache/pulsar/issues/12553 https://github.com/apache/pulsar/issues/12552 https://github.com/apache/pulsar/issues/12551
The condition that caused these issues to occur appears to be interaction with various pulsar entities (e.g. creating/deleting things in the management API, or attempting to create consumers) immediately after those entities were created or immediately after entities with the same name were deleted.
I think the number of issues observed speaks to a defect in the management API functionality in general. Considering the severity of these issues (in many cases it is possible to force a topic/namespace into a permanently corrupted state), I hope a resolution can be found for the general/common root cause rather than fixing individual bug-inducing conditions.
I suspect that the common root cause is that many management API operations are asynchronous that should not be.
Ideally, the resolution of all of these issues would be the same: a management API operation--any operation--should not return successfully until all observable side effects of that operation across a Pulsar cluster (including brokers, proxies, bookies, and ZK) were completed. All caches of metadata (e.g. on all brokers/proxies in the cluster) related to the operation should be cleared, and all persistent state (including ledger deletion, bookie cleanup, ZooKeeper metadata, etc.) should be updated during management API operations, and not afterwards.
If that means that management API operations take many seconds or minutes, that's still vastly preferable to not knowing when it is safe to interact with a cluster again after performing "DDL"-type changes.
The issue had no activity for 30 days, mark with Stale label.
This is reproducible on 2.9.1, though I'm not sure if the underlying cause is the same as it was before. It's possible that one or more issues preventing immediate consumer-connect were resolved, and a more garden-variety issue is now occurring but is being obscured by incorrect error reporting (https://github.com/apache/pulsar/issues/15078).
This issue is similar with apache/pulsar#15078 but it's related to the consumer, so I think we need another fix like https://github.com/apache/pulsar/pull/15161, which only fixes the producer side.
Describe the bug If I create a topic/subscription via the management API and then immediately attempt to connect a consumer to that subscription via the C++ client, the
subscribe
call sometimes (rarely; less than half of the time) fails withUnknownError
.This error does not coincide with any obvious issues in the broker or proxy log; sorry for the light detail on this bug.
To Reproduce Run reproduction plan for https://github.com/apache/pulsar/issues/12551; sometimes no error will occur, sometimes the error described in that issue or others will occur, and sometimes this error will occur during the last step (connecting a consumer).
Expected behavior
Client logs
This error is surfaced in Python (client version 2.8.1, on EKS amazon linux) as
UnknownError
.