apache / pulsar-client-cpp

Apache Pulsar C++ client library
https://pulsar.apache.org/
Apache License 2.0
53 stars 62 forks source link

Attempting to connect a consumer to a very recently created topic/subscription sometimes fails with UnknownError on the client #86

Open zbentley opened 3 years ago

zbentley commented 3 years ago

Describe the bug If I create a topic/subscription via the management API and then immediately attempt to connect a consumer to that subscription via the C++ client, the subscribe call sometimes (rarely; less than half of the time) fails with UnknownError.

This error does not coincide with any obvious issues in the broker or proxy log; sorry for the light detail on this bug.

To Reproduce Run reproduction plan for https://github.com/apache/pulsar/issues/12551; sometimes no error will occur, sometimes the error described in that issue or others will occur, and sometimes this error will occur during the last step (connecting a consumer).

Expected behavior

  1. Consumer connection either succeeds or fails with an informative error indicating what action is needed to correct this condition.
  2. This is probably more important: interactions with Pulsar (e.g. creating consumers) that happen a very short time after the entities being interacted with (topics/subscriptions in this case) should succeed; . This bug and the other similar ones I filed (see github links below) all seem to arise from CRUD operations with the management API being asynchronous: i.e. when I create/delete a tenant/topic/namespace, the actual side effects of that creation or deletion (e.g. adding/removing ledgers in BookKeeper, updating metadata in ZK) occur later, not during the API post. Not only is that bound to cause bugs like this, but it's also not what users expect; I would be happy to wait seconds or minutes for management API operations to complete in exchange for knowing that when they successfully complete that the thing I requested has actually been done.

Client logs

[persistent://blt6/chariot_ns_test/chariot_topic_test-partition-1, blt, 1] Failed to create consumer: UnknownError
Closing the consumer failed for partition - 1
Unable to create Consumer for partition - 1 Error - UnknownError
Closing the consumer failed for partition - 0
[persistent://blt6/chariot_ns_test/chariot_topic_test-partition-3, blt, 3] Failed to close consumer: ConnectError

This error is surfaced in Python (client version 2.8.1, on EKS amazon linux) as UnknownError.

zbentley commented 3 years ago

The following issues were all observed in response to similar testing: https://github.com/apache/pulsar-client-cpp/issues/86 https://github.com/apache/pulsar/issues/12556 https://github.com/apache/pulsar/issues/12555 https://github.com/apache/pulsar/issues/12554 https://github.com/apache/pulsar/issues/12553 https://github.com/apache/pulsar/issues/12552 https://github.com/apache/pulsar/issues/12551

The condition that caused these issues to occur appears to be interaction with various pulsar entities (e.g. creating/deleting things in the management API, or attempting to create consumers) immediately after those entities were created or immediately after entities with the same name were deleted.

I think the number of issues observed speaks to a defect in the management API functionality in general. Considering the severity of these issues (in many cases it is possible to force a topic/namespace into a permanently corrupted state), I hope a resolution can be found for the general/common root cause rather than fixing individual bug-inducing conditions.

I suspect that the common root cause is that many management API operations are asynchronous that should not be.

Ideally, the resolution of all of these issues would be the same: a management API operation--any operation--should not return successfully until all observable side effects of that operation across a Pulsar cluster (including brokers, proxies, bookies, and ZK) were completed. All caches of metadata (e.g. on all brokers/proxies in the cluster) related to the operation should be cleared, and all persistent state (including ledger deletion, bookie cleanup, ZooKeeper metadata, etc.) should be updated during management API operations, and not afterwards.

If that means that management API operations take many seconds or minutes, that's still vastly preferable to not knowing when it is safe to interact with a cluster again after performing "DDL"-type changes.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

zbentley commented 2 years ago

This is reproducible on 2.9.1, though I'm not sure if the underlying cause is the same as it was before. It's possible that one or more issues preventing immediate consumer-connect were resolved, and a more garden-variety issue is now occurring but is being obscured by incorrect error reporting (https://github.com/apache/pulsar/issues/15078).

BewareMyPower commented 2 years ago

This issue is similar with apache/pulsar#15078 but it's related to the consumer, so I think we need another fix like https://github.com/apache/pulsar/pull/15161, which only fixes the producer side.