apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
13.94k stars 3.53k forks source link

[fix][broker] Ensure that PulsarService is ready for serving incoming requests #22977

Closed lhotari closed 6 days ago

lhotari commented 1 week ago

Fixes #22975

Motivation

Pulsar broker will start serving requests while the broker is starting. This can cause issues and bugs which are hard to reproduce. Serving requests should be delayed until the broker is ready. The brokerId bug #22975 is just one example of problems that this PR will address.

Modifications

Documentation

dao-jun commented 1 week ago

Is this possible for us to start WebService/WebSocketService/BrokerService/ProtocolHander after the other Pulsar Services start?

lhotari commented 1 week ago

Is this possible for us to start WebService/WebSocketService/BrokerService/ProtocolHander after the other Pulsar Services start?

Yes, it makes sense to cover all cases in a way so that requests aren't handled before PulsarService has been started.

lhotari commented 1 week ago

Is this possible for us to start WebService/WebSocketService/BrokerService/ProtocolHander after the other Pulsar Services start?

Yes, it makes sense to cover all cases in a way so that requests aren't handled before PulsarService has been started.

@dao-jun

I renamed the concept to mean that the broker is ready to serve request instead of being fully started. I believe that this covers what is really needed and the original intention of this PR.

I checked the current solution and this is sufficiently covered already for the Pulsar broker with various configurations. There's no need to separately handle websockets since websocket servlets get added to the same Jetty server which already contains the filter to wait until PulsarService is ready for incoming requests.

ProtocolHandlers don't need any special handling since they get started as the last step in PulsarService.

dao-jun commented 1 week ago

@lhotari Are we able to move brokerService.start/webService.start/prototolHandler.start to the last of the method? It should be more elegant.

lhotari commented 1 week ago

@lhotari Are we able to move brokerService.start/webService.start/prototolHandler.start to the last of the method? It should be more elegant.

@dao-jun Elegant? I guess you mean more simple? I don't think that it is possible. For example, brokerId contains the actual port where the server socket gets bound to. For tests we use the port 0 to bind to a dynamic port and that's only available after the server socket has been binded. It's not only the brokerId, but also the service addresses of the broker that are resolved after the server socket has been binded. The alternative would be to restructure the code to just do the binding and delay the remaining parts of the Netty server initialization to the last step so that incoming requests aren't served before the broker is ready to accept requests. That wouldn't be a simple change.

The approach in this PR is fairly simple and I think that it's a minimal change to address the problem so that requests aren't served before the broker is ready to serve them. What's not elegant? :)

dao-jun commented 1 week ago

The alternative would be to restructure the code to just do the binding and delay the remaining parts of the Netty server initialization to the last step so that incoming requests aren't served before the broker is ready to accept requests. That wouldn't be a simple change.

Yes, I do prefer this way, but since it's not easy to do, this PR is the most simple way.

lhotari commented 1 week ago

@heesung-sn It looks like ExtensibleLoadManagerImpl accesses the broker itself before it is ready to serve requests. One possibility would be to refactor the logic to be asynchronous. I had to add a workaround for ExtensibleLoadManager in this commit: 18c97693488980ba8217fc78ef2910ad4800983b. There's a high chance to bugs unless this is addressed in ExtensibleLoadManagerImpl. I don't think that the broker should serve any incoming requests before the broker is in certain state where it's "ready". @heesung-sn would it be possible to make ExtensibleLoadManager initialization asynchronous so that it doesn't block the the startup sequence. with the changes in this PR (when the workaround is removed). It's org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl#start what is the problem.

lhotari commented 1 week ago

Since this PR might be blocked by other problems. There's #22981 to address #22975 alone without fixing the root cause.

lhotari commented 1 week ago

@heesung-sn I pushed an attempt to make this solution work with ExtensibleLoadManagerImpl.

codecov-commenter commented 6 days ago

Codecov Report

Attention: Patch coverage is 77.38095% with 19 lines in your changes missing coverage. Please review.

Project coverage is 73.39%. Comparing base (bbc6224) to head (fc593e7). Report is 424 commits behind head on master.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/apache/pulsar/pull/22977/graphs/tree.svg?width=650&height=150&src=pr&token=acYqCpsK9J&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)](https://app.codecov.io/gh/apache/pulsar/pull/22977?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) ```diff @@ Coverage Diff @@ ## master #22977 +/- ## ============================================ - Coverage 73.57% 73.39% -0.18% - Complexity 32624 32915 +291 ============================================ Files 1877 1903 +26 Lines 139502 142711 +3209 Branches 15299 15571 +272 ============================================ + Hits 102638 104749 +2111 - Misses 28908 29941 +1033 - Partials 7956 8021 +65 ``` | [Flag](https://app.codecov.io/gh/apache/pulsar/pull/22977/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | | |---|---|---| | [inttests](https://app.codecov.io/gh/apache/pulsar/pull/22977/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `27.73% <72.61%> (+3.14%)` | :arrow_up: | | [systests](https://app.codecov.io/gh/apache/pulsar/pull/22977/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `24.73% <27.38%> (+0.40%)` | :arrow_up: | | [unittests](https://app.codecov.io/gh/apache/pulsar/pull/22977/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `72.43% <77.38%> (-0.42%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/apache/pulsar/pull/22977?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | | |---|---|---| | [...n/java/org/apache/pulsar/broker/PulsarService.java](https://app.codecov.io/gh/apache/pulsar/pull/22977?src=pr&el=tree&filepath=pulsar-broker%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fpulsar%2Fbroker%2FPulsarService.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cHVsc2FyLWJyb2tlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcHVsc2FyL2Jyb2tlci9QdWxzYXJTZXJ2aWNlLmphdmE=) | `83.10% <100.00%> (+0.73%)` | :arrow_up: | | [...ache/pulsar/broker/namespace/NamespaceService.java](https://app.codecov.io/gh/apache/pulsar/pull/22977?src=pr&el=tree&filepath=pulsar-broker%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fpulsar%2Fbroker%2Fnamespace%2FNamespaceService.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cHVsc2FyLWJyb2tlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcHVsc2FyL2Jyb2tlci9uYW1lc3BhY2UvTmFtZXNwYWNlU2VydmljZS5qYXZh) | `73.21% <100.00%> (+0.97%)` | :arrow_up: | | [...ulsar/broker/service/PulsarChannelInitializer.java](https://app.codecov.io/gh/apache/pulsar/pull/22977?src=pr&el=tree&filepath=pulsar-broker%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fpulsar%2Fbroker%2Fservice%2FPulsarChannelInitializer.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cHVsc2FyLWJyb2tlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcHVsc2FyL2Jyb2tlci9zZXJ2aWNlL1B1bHNhckNoYW5uZWxJbml0aWFsaXplci5qYXZh) | `95.08% <100.00%> (+0.08%)` | :arrow_up: | | [...va/org/apache/pulsar/broker/service/ServerCnx.java](https://app.codecov.io/gh/apache/pulsar/pull/22977?src=pr&el=tree&filepath=pulsar-broker%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fpulsar%2Fbroker%2Fservice%2FServerCnx.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cHVsc2FyLWJyb2tlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcHVsc2FyL2Jyb2tlci9zZXJ2aWNlL1NlcnZlckNueC5qYXZh) | `72.18% <100.00%> (+0.03%)` | :arrow_up: | | [.../java/org/apache/pulsar/broker/web/WebService.java](https://app.codecov.io/gh/apache/pulsar/pull/22977?src=pr&el=tree&filepath=pulsar-broker%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fpulsar%2Fbroker%2Fweb%2FWebService.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cHVsc2FyLWJyb2tlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcHVsc2FyL2Jyb2tlci93ZWIvV2ViU2VydmljZS5qYXZh) | `90.78% <62.50%> (-3.36%)` | :arrow_down: | | [...dbalance/extensions/ExtensibleLoadManagerImpl.java](https://app.codecov.io/gh/apache/pulsar/pull/22977?src=pr&el=tree&filepath=pulsar-broker%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fpulsar%2Fbroker%2Floadbalance%2Fextensions%2FExtensibleLoadManagerImpl.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-cHVsc2FyLWJyb2tlci9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcHVsc2FyL2Jyb2tlci9sb2FkYmFsYW5jZS9leHRlbnNpb25zL0V4dGVuc2libGVMb2FkTWFuYWdlckltcGwuamF2YQ==) | `80.52% <74.50%> (+0.44%)` | :arrow_up: | ... and [464 files with indirect coverage changes](https://app.codecov.io/gh/apache/pulsar/pull/22977/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)