Open zbentley opened 3 years ago
The following issues were all observed in response to similar testing: https://github.com/apache/pulsar-client-cpp/issues/86 https://github.com/apache/pulsar/issues/12556 https://github.com/apache/pulsar/issues/12555 https://github.com/apache/pulsar/issues/12554 https://github.com/apache/pulsar/issues/12553 https://github.com/apache/pulsar/issues/12552 https://github.com/apache/pulsar/issues/12551
The condition that caused these issues to occur appears to be interaction with various pulsar entities (e.g. creating/deleting things in the management API, or attempting to create consumers) immediately after those entities were created or immediately after entities with the same name were deleted.
I think the number of issues observed speaks to a defect in the management API functionality in general. Considering the severity of these issues (in many cases it is possible to force a topic/namespace into a permanently corrupted state), I hope a resolution can be found for the general/common root cause rather than fixing individual bug-inducing conditions.
I suspect that the common root cause is that many management API operations are asynchronous that should not be.
Ideally, the resolution of all of these issues would be the same: a management API operation--any operation--should not return successfully until all observable side effects of that operation across a Pulsar cluster (including brokers, proxies, bookies, and ZK) were completed. All caches of metadata (e.g. on all brokers/proxies in the cluster) related to the operation should be cleared, and all persistent state (including ledger deletion, bookie cleanup, ZooKeeper metadata, etc.) should be updated during management API operations, and not afterwards.
If that means that management API operations take many seconds or minutes, that's still vastly preferable to not knowing when it is safe to interact with a cluster again after performing "DDL"-type changes.
The condition that caused these issues to occur appears to be interaction with various pulsar entities (e.g. creating/deleting things in the management API, or attempting to create consumers) immediately after those entities were created or immediately after entities with the same name were deleted.
I think the number of issues observed speaks to a defect in the management API functionality in general. Considering the severity of these issues (in many cases it is possible to force a topic/namespace into a permanently corrupted state), I hope a resolution can be found for the general/common root cause rather than fixing individual bug-inducing conditions.
I suspect that the common root cause is that many management API operations are asynchronous that should not be.
All operations are returned only after a successful write to metadata service.
If that induces a permanent inconsistency state, it's a bug that absolutely needs to get fixed.
Ideally, the resolution of all of these issues would be the same: a management API operation--any operation--should not return successfully until all observable side effects of that operation across a Pulsar cluster (including brokers, proxies, bookies, and ZK) were completed. All caches of metadata (e.g. on all brokers/proxies in the cluster) related to the operation should be cleared, and all persistent state (including ledger deletion, bookie cleanup, ZooKeeper metadata, etc.) should be updated during management API operations, and not afterwards.
If that means that management API operations take many seconds or minutes, that's still vastly preferable to not knowing when it is safe to interact with a cluster again after performing "DDL"-type changes.
In reality, this is not an easy thing to do in any distributed system, barring of completely eliminating the caches.
In a distributed system, there is no way to 100% ensure all caches have been invalidated.
If that induces a permanent inconsistency state, it's a bug that absolutely needs to get fixed.
This specific issue is a permanent inconsistency. The others usually go away after waiting a few minutes.
In reality, this is not an easy thing to do in any distributed system, barring of completely eliminating the caches.
Is there some way instead to bypass cached metadata on a per-operation basis? Could that be exposed as a parameter for management API operations?
Alternatively, would it be possible to add behavior like "when the backend encounters an error that could be due to an out-of-date cache, dump the cache and try once more against authoritative state"?
I ask because the current workaround (add what amounts to large numbers of sleeps and retries around common management API and create-producer/subscribe interactions) is pretty costly both in runtime and complexity.
The issue had no activity for 30 days, mark with Stale label.
In a system that is not under heavy load on metadata the local caches should be updated in a timely manner. If the system does not converge to a consistent view over metadata then we have some bugs. We recently found a problem on the usage of Caffeine in the local metadata cache.
The alternative is to switch to a distributed local cache that guarantees causal consistency for the near caches, that is, when I write to the cache and the write returns I am sure that every local copy in each near cache is invalidated (or updated).
Some years ago I worked on this project to solve this problem https://github.com/diennea/blazingcache It implements a very lightweight cache with near caching support and strong guarantees about causal consistency. Because we really needed to have such guarantees of distributed cache invalidation.
@lhotari @michaeljmarshall @merlimat @codelipenghui
The issue had no activity for 30 days, mark with Stale label.
can confirm that this issue seems to persist in pulsar 3.0.0
Describe the bug If I create a namespace, put a persistent, partitioned topic in it, and then immediately thereafter delete the topic and attempt to delete the namespace, namespace deletion sometimes fails with a 500 error that corresponds to "MetadataNotFoundException: Managed ledger not found" in the broker's log.
Note that this error condition is persistent; if it occurs once it keeps occurring; the namespace can never be deleted.
Note that I was connecting to this broker through a Pulsar proxy; my client is programmed to raise an exception on 307 redirect codes, so the 307 response served by the broker in the log snippet below must have been handled by the proxy, not my client.
To Reproduce Run reproduction plan for https://github.com/apache/pulsar/issues/12551; sometimes no error will occur, sometimes the error described in that issue or others will occur, and sometimes this error will occur.
Expected behavior
Environment Same environment as https://github.com/apache/pulsar/issues/12551
What my client sees
Broker Stacktrace