Closed lhotari closed 1 month ago
@lhotari
Add a new metric pulsar_replicated_subscriptions_snapshot_timeouts which is a counter (that only resets when the broker restarts).
Agree with you
@lhotari I have made a PR in forked repo can you take a look.
@nikam14 looks good. Please go ahead and create a apache/pulsar PR. Please fill in the details in the description and name the PR properly too. The contribution guide contains advice unless the PR template explains it. For metrics, there will also need to be documentation to be added to pulsar-site repository. You can usually get help also on Apache Pulsar Slack's #dev channel for anything related to contributions.
Search before asking
Motivation
Geo replication replicated subscriptions (PIP-33) snapshot creation might time out. The code contains a debug log message when this happens: https://github.com/apache/pulsar/blob/465fac523da946553b09298e13dc7dfcecfb6c78/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/ReplicatedSubscriptionsController.java#L256 When this happens, the subscription state won't be reflected on the remote side and a backlog would build up. There's no metric to detect this situation.
Solution
Add a new metric
pulsar_replicated_subscriptions_snapshot_timeouts
which is a counter (that only resets when the broker restarts).Alternatives
No response
Anything else?
Increasing the timeout threshold
replicatedSubscriptionsSnapshotTimeoutSeconds=30
->replicatedSubscriptionsSnapshotTimeoutSeconds=60
could help resolve the situation. This metric would help detect when it would be necessary.Are you willing to submit a PR?