[X] I searched in the issues and found nothing similar.
Read release policy
[X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.
Version
pulsar-3.0.6
Minimal reproduce step
stop a bookie in a cluster
restart broker
restart the stopped bookie
can find that bookie's rack information is lost, become /defaultRegion/defaultRack
What did you expect to see?
..
What did you see instead?
After upgrade to pulsar-3.0.6,observe that when bookie restart, some bookie's rack information become /defaultRegion/defaultRack,which is not correct.
After diving into code and error log, this issue is probably due to this pr, https://github.com/apache/pulsar/pull/22846. This pr made BookieRackAffinityMapping#watchAvailableBookies become async. However, I think this operation can not be async.
Search before asking
Read release policy
Version
pulsar-3.0.6
Minimal reproduce step
What did you expect to see?
..
What did you see instead?
After upgrade to pulsar-3.0.6,observe that when bookie restart, some bookie's rack information become /defaultRegion/defaultRack,which is not correct.
After diving into code and error log, this issue is probably due to this pr, https://github.com/apache/pulsar/pull/22846. This pr made BookieRackAffinityMapping#watchAvailableBookies become async. However, I think this operation can not be async.
Let's see what happen when bookieClient construct in pulsar. we can see the code in https://github.com/apache/bookkeeper/blob/1f1df813b9b4efd410925caadfa45cfb17b811ba/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/BookKeeper.java#L409-L548
When we receive notification for bookie creation in metadataStore, it would go into this code block, execute first listener, and then second listener. https://github.com/apache/pulsar/blob/a8ae3e4d191c75f291ccb29577c181926a5f4e5d/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/bookkeeper/PulsarRegistrationClient.java#L221-L233
When we execute second listener to do placementPolicy.onClusterChanged(), it would finally go into here, execute resolver.resolve(names). This resolver's implementation is BookieRackAffinityMapping#resolve. https://github.com/apache/bookkeeper/blob/1f1df813b9b4efd410925caadfa45cfb17b811ba/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/TopologyAwareEnsemblePlacementPolicy.java#L554-L585
Therefore, we can see that the second listener actually depend on the first listener. They must be executed in a sync way.
But now we change to async way. So when a bookie restart, broker would permanently lost the rack information of this bookie, causing serious problem.
We add a log in BookieRackAffinityMapping#updateRacksWithHost, and confirm that the problem occur once the async code is executed later.
Anything else?
pulsar-2.9 do not have this issue.
Are you willing to submit a PR?