apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.26k stars 3.59k forks source link

[Bug] Broker would lost bookie rack information in pulsar new version #23282

Open TakaHiR07 opened 2 months ago

TakaHiR07 commented 2 months ago

Search before asking

Read release policy

Version

pulsar-3.0.6

Minimal reproduce step

  1. stop a bookie in a cluster
  2. restart broker
  3. restart the stopped bookie
  4. can find that bookie's rack information is lost, become /defaultRegion/defaultRack

What did you expect to see?

..

What did you see instead?

After upgrade to pulsar-3.0.6,observe that when bookie restart, some bookie's rack information become /defaultRegion/defaultRack,which is not correct.

After diving into code and error log, this issue is probably due to this pr, https://github.com/apache/pulsar/pull/22846. This pr made BookieRackAffinityMapping#watchAvailableBookies become async. However, I think this operation can not be async.

Let's see what happen when bookieClient construct in pulsar. we can see the code in https://github.com/apache/bookkeeper/blob/1f1df813b9b4efd410925caadfa45cfb17b811ba/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/BookKeeper.java#L409-L548

When we receive notification for bookie creation in metadataStore, it would go into this code block, execute first listener, and then second listener. https://github.com/apache/pulsar/blob/a8ae3e4d191c75f291ccb29577c181926a5f4e5d/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/bookkeeper/PulsarRegistrationClient.java#L221-L233

When we execute second listener to do placementPolicy.onClusterChanged(), it would finally go into here, execute resolver.resolve(names). This resolver's implementation is BookieRackAffinityMapping#resolve. https://github.com/apache/bookkeeper/blob/1f1df813b9b4efd410925caadfa45cfb17b811ba/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/TopologyAwareEnsemblePlacementPolicy.java#L554-L585

Therefore, we can see that the second listener actually depend on the first listener. They must be executed in a sync way.

But now we change to async way. So when a bookie restart, broker would permanently lost the rack information of this bookie, causing serious problem.

We add a log in BookieRackAffinityMapping#updateRacksWithHost, and confirm that the problem occur once the async code is executed later.

14:38:18.628 [metadata-store-38-1] INFO  org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient - Bookie ip1:port1 created. path: /ledgers/available/ip1:port1
14:38:18.629 [metadata-store-38-1] INFO  org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient - Bookie ip1:port1 created. path: /ledgers/available/ip1:port1
14:38:18.635 [metadata-store-38-1] INFO  org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient - Update BookieInfoCache (writable bookie) ip1:port1 -> BookieServiceInfo{properties={}, endpoints=[EndpointInfo{id=bookie, port=port1, host=ip1, protocol=bookie-rpc, auth=[], extensions=[]}]}
14:38:18.636 [metadata-store-38-1] INFO  org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient - Update BookieInfoCache (writable bookie) ip1:port1 -> BookieServiceInfo{properties={}, endpoints=[EndpointInfo{id=bookie, port=port1, host=ip1, protocol=bookie-rpc, auth=[], extensions=[]}]}
14:38:18.637 [pulsar-registration-client-46-1] WARN  org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy - Failed to resolve network location for ip1, using default rack for it : /default-region/default-rack.
14:38:18.637 [pulsar-registration-client-63-1] WARN  org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy - Failed to resolve network location for ip1, using default rack for it : /default-region/default-rack.
14:38:18.637 [pulsar-registration-client-63-1] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Adding a new node: /default-region/default-rack/ip1:port1
14:38:18.637 [pulsar-registration-client-46-1] INFO  org.apache.bookkeeper.net.NetworkTopologyImpl - Adding a new node: /default-region/default-rack/ip1:port1
14:38:18.638 [pulsar-registration-client-63-1] WARN  org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy - Failed to resolve network location for ip1, using default rack for it : /default-region/default-rack.
14:38:18.638 [pulsar-registration-client-46-1] WARN  org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy - Failed to resolve network location for ip1, using default rack for it : /default-region/default-rack.
14:38:18.640 [metadata-store-38-1] INFO  org.apache.pulsar.bookie.rackawareness.BookieRackAffinityMapping - trigger updateRacksWithHost

Anything else?

pulsar-2.9 do not have this issue.

Are you willing to submit a PR?