apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.25k stars 3.58k forks source link

[Bug] Broker sees topic fenced #20526

Open KannarFr opened 1 year ago

KannarFr commented 1 year ago

Search before asking

Version

2.11.1

Minimal reproduce step

I have a cluster with thousands of topics and one became fenced see broker's logs:

Jun 07 11:20:48 yo-pulsar-broker-c3-n4 pulsar[336]: 2023-06-07T11:20:48,089+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] WARN  org.apache.pulsar.broker.service.AbstractTopic - [persistent://tenant/ns/topic-partition-0] Attempting to add producer to a fenced topic
Jun 07 11:20:48 yo-pulsar-broker-c3-n4 pulsar[336]: 2023-06-07T11:20:48,089+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] WARN  org.apache.pulsar.broker.service.ServerCnx - [/192.168.1.3:46348] Failed to add producer to topic persistent://tenant/ns/topic-partition-0: producerId=360, Topic is tempo
rarily unavailable
Jun 07 11:20:48 yo-pulsar-broker-c3-n4 pulsar[336]: 2023-06-07T11:20:48,102+0000 [BookKeeperClientWorker-OrderedExecutor-10-0] WARN  org.apache.pulsar.broker.service.AbstractTopic - [persistent://tenant/ns/topic-partition-0] Attempting to add producer to a fenced topic
Jun 07 11:20:48 yo-pulsar-broker-c3-n4 pulsar[336]: 2023-06-07T11:20:48,102+0000 [BookKeeperClientWorker-OrderedExecutor-10-0] WARN  org.apache.pulsar.broker.service.ServerCnx - [/192.168.1.3:46348] Failed to add producer to topic persistent://tenant/ns/topic-partition-0: producerId=361, Topic is temp
orarily unavailable
Jun 07 11:20:48 yo-pulsar-broker-c3-n4 pulsar[336]: 2023-06-07T11:20:48,112+0000 [BookKeeperClientWorker-OrderedExecutor-10-0] WARN  org.apache.pulsar.broker.service.AbstractTopic - [persistent://tenant/ns/topic-partition-0] Attempting to add producer to a fenced topic
Jun 07 11:20:48 yo-pulsar-broker-c3-n4 pulsar[336]: 2023-06-07T11:20:48,112+0000 [BookKeeperClientWorker-OrderedExecutor-10-0] WARN  org.apache.pulsar.broker.service.ServerCnx - [/192.168.1.3:46348] Failed to add producer to topic persistent://tenant/ns/topic-partition-0: producerId=362, Topic is temp
orarily unavailable

The topic was fenced for 30mins. I just restarted the broker and everything looks good now. Any idea? A wrong cache of fenced or something?

What did you expect to see?

Not fenced

What did you see instead?

Fenced

Anything else?

No response

Are you willing to submit a PR?

lhotari commented 1 year ago

Thanks for reporting this. This is an issue which has stayed unresolved for many years. Some previous reports: #5284 and #14941.

There's an ugly workaround for the problem: By setting topicFencingTimeoutSeconds=5 for brokers, it will release the "fencing" after 5 seconds. However, there is a chance that this causes other problems such as data consistency problems. If metadata gets overwritten, it could lead to data loss.

The recently merged fixes #18688 and #20527 could help improve the situation. I happened to investigate problems in this area yesterday.

I have created #20540 to address some issues that I have observed in the current solution. One of the remaining challenges in the PR is adding proper test coverage. I'm also waiting for feedback from other code contributors on the PR before finishing it. I'd appreciate feedback on the PR #20540.

poorbarcode commented 1 year ago

@KannarFr

Jun 07 11:20:48 yo-pulsar-broker-c3-n4 pulsar[336]: 2023-06-07T11:20:48,112+0000 [BookKeeperClientWorker-OrderedExecutor-10-0] WARN org.apache.pulsar.broker.service.AbstractTopic - [persistent://tenant/ns/topic-partition-0] Attempting to add producer to a fenced topic

At this time, do you know if there is a bundle unload or a namespace unload executed?

You can check the HTTP request log to confirm it.

KannarFr commented 1 year ago

Unfortunately, I do not have suck logs retention. I should take a dump, my bad.

github-actions[bot] commented 1 year ago

The issue had no activity for 30 days, mark with Stale label.

StevenLeRoux commented 10 months ago

Still impacted with this issue. Restarting brokers all day long doesn't seem a proper situation. I wonder if there is any production deployment that's not concerned by this issue, if so, how?

lhotari commented 10 months ago

Still impacted with this issue. Restarting brokers all day long doesn't seem a proper situation. I wonder if there is any production deployment that's not concerned by this issue, if so, how?

@StevenLeRoux which Pulsar version are you using? do you have a chance to test #20540 with a custom build?

StevenLeRoux commented 10 months ago

@lhotari Thanks for pointing out to #20540

We're using currently v3.1.1, but we will get the chance to test with #20540 in a few days (cc @KannarFr )

lhotari commented 10 months ago

@lhotari Thanks for pointing out to #20540

We're using currently v3.1.1, but we will get the chance to test with #20540 in a few days (cc @KannarFr )

@StevenLeRoux @KannarFr FYI, there's a new bug report #21860 in this area with a promising bug fix in the Bookkeeper client in the works.

KannarFr commented 10 months ago

Thanks for pinging us @lhotari.