apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.25k stars 3.58k forks source link

Size of replication backlog becomes very large #6438

Closed massakam closed 1 year ago

massakam commented 4 years ago

Recently, the number of messages in the replication backlog for a particular topic has become very large.

replication_backlog

This topic is replicated on two clusters, and all producers and consumers are connected to only one cluster. The strange thing is that the replication backlog is larger in the cluster where no producer and consumer are connected. The following is the stats of the topic in that cluster.

{
  "msgRateIn" : 1410.798423526815,
  "msgThroughputIn" : 556605.2280307647,
  "msgRateOut" : 0.0,
  "msgThroughputOut" : 0.0,
  "averageMsgSize" : 394.5320740005671,
  "storageSize" : 2455313235,
  "publishers" : [ ],
  "subscriptions" : { },
  "replication" : {
    "jp-west" : {
      "msgRateIn" : 1410.798423526815,
      "msgThroughputIn" : 556605.2280307647,
      "msgRateOut" : 0.0,
      "msgThroughputOut" : 0.0,
      "msgRateExpired" : 0.0,
      "replicationBacklog" : 6258001,
      "connected" : false,
      "replicationDelayInSeconds" : 0,
      "inboundConnection" : "/xxx.xxx.xxx.xxx:40710",
      "inboundConnectedSince" : "2020-01-08T01:38:23.565+09:00"
    }
  },
  "deduplicationStatus" : "Disabled"
}

Notable is the "connected": false part. Since this topic is not active (no producer or consumer) in this cluster, it is seems that the replicator has been closed by topic GC.

I think the cause of this issue is that the replicator throttles reading entries while the producer for geo-replication is closed. If the publish rate of messages is high, reading entries by the replicator will not keep up with message publishing and the replication backlog will increase. https://github.com/apache/pulsar/blob/v2.3.2/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentReplicator.java#L155-L162

It is reasonable to throttle reading of messages published to the local cluster while the producer for geo-replication is closed. However, there is no need to throttle reading messages replicated from other clusters. The replicator discards these messages and does not send them using the producer. https://github.com/apache/pulsar/blob/v2.3.2/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentReplicator.java#L226-L232

tisonkun commented 1 year ago

Closed as stale. Please create a new issue if it's still relevant to the maintained versions.

VijayRohra commented 1 year ago

I have 2 cluster setup, and I have enabled bi-directional geo-replication and I am facing the same issue mentioned above. Is there any update regarding same?