apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.26k stars 3.59k forks source link

[Bug] A large backlog of Key_Shared subscription messages will result in fullgc and OOM #21045

Open jdfrozen opened 1 year ago

jdfrozen commented 1 year ago

Search before asking

Version

2.7.x

Minimal reproduce step

1、A large backlog of Key_Shared subscription messages 2、The subscription has multiple consumers

What did you expect to see?

broker functioning

What did you see instead?

1、broker frequent gc 2、broker fullgc 3、broker OOM

This is broker gc monitoring image

Add parameters to the use of boot “-XX:+HeapDumpOnOutOfMemoryError”, When fullgc is sent, the analysis is done through mat

image

Anything else?

Root cause: redeliveryMessages contains a large number of messages

PersistentStickyKeyDispatcherMultipleConsumers.java

@Override
protected synchronized Set<PositionImpl> getMessagesToReplayNow(int maxMessagesToRead) {
    if (isDispatcherStuckOnReplays) {
        // If we're stuck on replay, we want to move forward reading on the topic (until the overall max-unacked
        // messages kicks in), instead of keep replaying the same old messages, since the consumer that these
        // messages are routing to might be busy at the moment
        this.isDispatcherStuckOnReplays = false;
        return Collections.emptySet();
    } else {
        return super.getMessagesToReplayNow(maxMessagesToRead);
    }
}

Are you willing to submit a PR?

jdfrozen commented 1 year ago

So when this Key_Shared subscription has a lot of consumers, and some consumers are slow consumers, and some consumers start messaging and find out that stickyKeyHash is for slow consumers, Then these messages will add MessagetoReplay, and a large backlog will cause this problem

jdfrozen commented 1 year ago

Add parameters to the use of boot "-XX:+HeapDumpBeforeFullGC"

mattisonchao commented 1 year ago

The KEY_SHARE mode is a somewhat strict type. That is very sensitive to the consumption(acknowledgement) rate since it should ensure the message order. when adding some consumers to the subscription, the key hash should be recalculated, and some new messages index should keep in the broker memory to avoid breaking delivery order.(one key deliver to one consumer at the moment)

Therefore, It's expected behaviour. You can check why some of your consumers can't catch up or consider If you can try to use another subscription mode like SHARED.

But anyway. You are right. We should have a limit on this container's memory usage to avoid one topic affecting the whole broker. :)

jdfrozen commented 1 year ago

I verified and tested the set-max-unacked-messages-per-subscription as small as 1000 to avoid fullgc. When I verify, I use the namespace policy "pulsar-admin namespaces get-max-unacked messages-per-subscription" We want to set the topics level policy. We are using version 2.7.4. Is the topics level policy stable enough?

mattisonchao commented 1 year ago

Hi, @jdfrozen
2.7.x is a kinda old version, I am unsure if it can work properly. But you can give it a try. :)

github-actions[bot] commented 1 year ago

The issue had no activity for 30 days, mark with Stale label.

lhotari commented 2 months ago

One of the root causes behind this issue is described in #23200 . It's addressed by #23231 and #23226. I believe that the OOM issue got mitigated already by #17804.