[Bug] RateLimiter lock contention when use precise publish rate limiter

Shawyeok commented 11 months ago

Search before asking

[X] I searched in the issues and found nothing similar.

After enabled the preciseTopicPublishRateLimiterEnable configuration in broker.conf, if one topic on the broker exceeds the publish rate limit (for example, 1MB/s), it will cause lock contention, leading to an increase in the 99th percentile write latency for other topics that have not reached the rate limit.

The profile shows that there is lock content inside the method org.apache.pulsar.broker.service.PrecisPublishLimiter#tryAcquire.

pulsar-3.0.1.lock.profile.008.html.zip

Minimal reproduce step

Noticed in a fork of 2.8.1, reproduced in 3.0.1 on Linux CentOS 7.9 3.10.0-1160.99.1.el7.x86_64

pulsar-admin namespaces create public/ratelimited
pulsar-admin namespaces set-publish-rate -b 1048576 public/ratelimited
pulsar-admin namespaces policies public/ratelimited
pulsar-perf produce persistent://public/default/no-rate-limit
pulsar-perf produce -r 1600 -ioThreads 4 -n 100 persistent://public/ratelimited/topic-1

What did you expect to see?

One topic exceeds publish rate limit should not effect other topics write latency.

What did you see instead?

Another topic write latency also increased.

Anything else?

No response

Are you willing to submit a PR?

[ ] I'm willing to submit a PR!

1625567290 commented 11 months ago

Hello, I am also learning pulsar recently, have you found a solution to this problem？

Shawyeok commented 10 months ago

Sorry for late reply.

The current publish rate limiting is based on setting the autoRead flag of the Netty channel to false. However, autoRead is just a hint in Netty, not a guarantee (if anyone is familiar with this, a detailed explanation would be great). As a result, topics that have already been rate-limited may call the isTopicPublishRateExceeded method multiple times before the next rate-limiting period arrives, which is unnecessary. So, an optimization I'm thinking of is to set a flag after a topic triggers rate limiting, to avoid having to check whether the topic has already been rate-limited through the isTopicPublishRateExceeded method each time.

Of course, we can also optimize the implementation of org.apache.pulsar.common.util.RateLimiter. Perhaps an implementation based on Compare-And-Swap (CAS) would perform better than synchronized. All of these ideas need to be validated through experimentation.

Shawyeok commented 10 months ago

There is some related discuss in mail list:

1625567290 commented 10 months ago

Sorry for late reply.

The current publish rate limiting is based on setting the autoRead flag of the Netty channel to false. However, autoRead is just a hint in Netty, not a guarantee (if anyone is familiar with this, a detailed explanation would be great). As a result, topics that have already been rate-limited may call the isTopicPublishRateExceeded method multiple times before the next rate-limiting period arrives, which is unnecessary. So, an optimization I'm thinking of is to set a flag after a topic triggers rate limiting, to avoid having to check whether the topic has already been rate-limited through the isTopicPublishRateExceeded method each time.

Of course, we can also optimize the implementation of org.apache.pulsar.common.util.RateLimiter. Perhaps an implementation based on Compare-And-Swap (CAS) would perform better than synchronized. All of these ideas need to be validated through experimentation.

Sorry, I am a beginner of pulsar, there is a problem I don't understand, can you help me solve it? The questioner has said affected other theme writing efficiency, the current theme open preciseTopicPublishRateLimiterEnable, why can affect other theme. I can't find the answer according to your answer, can you help me?

Shawyeok commented 10 months ago

Sorry for late reply. The current publish rate limiting is based on setting the autoRead flag of the Netty channel to false. However, autoRead is just a hint in Netty, not a guarantee (if anyone is familiar with this, a detailed explanation would be great). As a result, topics that have already been rate-limited may call the isTopicPublishRateExceeded method multiple times before the next rate-limiting period arrives, which is unnecessary. So, an optimization I'm thinking of is to set a flag after a topic triggers rate limiting, to avoid having to check whether the topic has already been rate-limited through the isTopicPublishRateExceeded method each time. Of course, we can also optimize the implementation of org.apache.pulsar.common.util.RateLimiter. Perhaps an implementation based on Compare-And-Swap (CAS) would perform better than synchronized. All of these ideas need to be validated through experimentation.

Sorry, I am a beginner of pulsar, there is a problem I don't understand, can you help me solve it? The questioner has said affected other theme writing efficiency, the current theme open preciseTopicPublishRateLimiterEnable, why can affect other theme. I can't find the answer according to your answer, can you help me?

In Pulsar internal, there are multiple topics that depend on the same thread for message writing (such as Netty's EventLoop and OrderedExecutor). If, for some reason, these threads encounter a performance bottleneck, such as the lock contention observed in this case, then even those topics not directly causing the lock contention will be affected, as long as they are also bound to these same threads.

lhotari commented 10 months ago

Of course, we can also optimize the implementation of org.apache.pulsar.common.util.RateLimiter. Perhaps an implementation based on Compare-And-Swap (CAS) would perform better than synchronized. All of these ideas need to be validated through experimentation.

@Shawyeok I'm working on a non-blocking implementation that uses CAS. Please check https://github.com/lhotari/async-tokenbucket for the PoC and performance test.

In addition, I'll be fixing the issue in handling the way how autoread is toggled. (WIP at https://github.com/lhotari/pulsar/blob/lh-rate-limiter-improvement/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ThrottleTracker.java )

lhotari commented 10 months ago

Hello, I am also learning pulsar recently, have you found a solution to this problem？

@1625567290 Just curious to learn about your motivation to start learning Pulsar here. This might not be the easiest way to learn Pulsar. :)

Shawyeok commented 10 months ago

Of course, we can also optimize the implementation of org.apache.pulsar.common.util.RateLimiter. Perhaps an implementation based on Compare-And-Swap (CAS) would perform better than synchronized. All of these ideas need to be validated through experimentation.

@Shawyeok I'm working on a non-blocking implementation that uses CAS. Please check https://github.com/lhotari/async-tokenbucket for the PoC and performance test.

In addition, I'll be fixing the issue in handling the way how autoread is toggled. (WIP at https://github.com/lhotari/pulsar/blob/lh-rate-limiter-improvement/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ThrottleTracker.java )

@lhotari Cool! I will study and verify your solution as soon as possible after my vacation, which may be a few days.

lhotari commented 10 months ago

@lhotari Cool! I will study and verify your solution as soon as possible after my vacation, which may be a few days.

Sounds good. Please check the blog post https://codingthestreams.com/pulsar/2023/11/22/pulsar-slos-and-rate-limiting.html for the full context.

lhotari commented 9 months ago

@Shawyeok There's now "PIP-322: Pulsar Rate Limiting Refactoring" (#21680) to address this issue. The changes are in the draft PR #21681 . Please review! Thanks

apache / pulsar