Handle already have connected replicate produce with same name

Describe the bug Broker print the following log continuous

To Reproduce Well, it's hard to reproduce.

Expected behavior Stop create replicate producer while already have connected replicate producer with same name.

I am facing the same issue.

I have 5 datacenters with each a pulsar cluster. So that make 5 pulsar clusters (named 'region1' ... 'region5'), with geo replication for some namespaces.

in 3 datacenters (region2, region4, region5), it looks all ok: other replication producers are connected just fine.

but in 2 datacenters (region1, region3), I have logs like (those come from region3):

09:16:10.222 [pulsar-io-22-9] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.11.0.7:51418][persistent://tenant/namespace/topic-partition-10] Creating producer. producerId=11331
09:16:10.223 [ForkJoinPool.commonPool-worker-7] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.11.0.7:51418]-11331 persistent://tenant/namespace/topic-partition-10 configured with schema false
09:16:10.224 [ForkJoinPool.commonPool-worker-7] ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.11.0.7:51418] Failed to add producer to topic persistent://tenant/namespace/topic-partition-10: Producer with name 'pulsar.repl.region2' is already connected to topic

and in the broker on the other side, in region2:

09:16:10.164 [pulsar-io-22-4] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-10] [pulsar.repl.region2] Creating producer on cnx [id: 0x7cd1d935, L:/10.11.0.7:51418 - R:10.12.0.172/10.12.0.172:6650]
09:16:10.285 [pulsar-io-22-1] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x7cd1d935, L:/10.11.0.7:51418 - R:10.12.0.172/10.12.0.172:6650] Received error from server: Producer with name 'pulsar.repl.region2' is already connected to topic
09:16:10.285 [pulsar-io-22-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-10] [pulsar.repl.region2] Failed to create producer: Producer with name 'pulsar.repl.region2' is already connected to topic

Right now, replication is having issues for that namespace and replications :

region3 -> region1
region1 -> region3
region2 -> region3
region5 -> region1
region5 -> region3 all other combinations are ok.

What I can see looking at pulsar-admin topics partitioned-stats --per-partition tenant/namespace/topic for example for one partition in region1:

{
   ...
  "partitions": {
    "persistent://tenant/namespace/topic-partition-1" : {
      "replication" : {
        "region2" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.11.1.120:42292",
          "inboundConnectedSince" : "2019-10-15T08:56:40.286Z",
          "outboundConnection" : "[id: 0x680b115b, L:/10.10.3.46:39946 - R:10.11.1.120/10.11.1.120:6650]",
          "outboundConnectedSince" : "2019-10-15T09:17:03.798Z"
        },
        "region3" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.01666736530872926,
          "replicationBacklog" : 0,
          "connected" : false,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.12.3.213:43300",
          "inboundConnectedSince" : "2019-10-15T08:47:29.728Z"
        },
        "region4" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.13.3.30:45422",
          "inboundConnectedSince" : "2019-10-15T08:56:41.039Z",
          "outboundConnection" : "[id: 0x27db1c8d, L:/10.10.3.46:56448 - R:10.13.3.30/10.13.3.30:6650]",
          "outboundConnectedSince" : "2019-10-15T09:17:04.282Z"
        },
        "region5" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.14.1.64:36956",
          "inboundConnectedSince" : "2019-10-15T08:17:24.712Z",
          "outboundConnection" : "[id: 0x674e53e5, L:/10.10.3.46:54772 - R:10.14.0.208/10.14.0.208:6650]",
          "outboundConnectedSince" : "2019-10-15T09:17:03.626Z"
        }
      },
      "deduplicationStatus" : "Disabled"
    }
  }
}

As we can see, there is no outbound connection for region3. Hence why there is no replication region1 -> region3. This is one that is attempted but failing with above logs. And there is an inbound connection for all regions, reason why region3 and region5 cannot reopen it (region3 and region5 have no outbound connections for that topic).

Still digging, but I think there is a bug around here.

Additional information to understand the above logs and explanations.

IPs per region:

region1 : 10.10.0.0/16
region2 : 10.11.0.0/16
region3 : 10.12.0.0/16
region4 : 10.13.0.0/16
region5 : 10.14.0.0/16

We have 5 brokers per region. And for that topic, 12 partitions.

Regarding the partitioned-stats from previous message and the inbound connection from region3 to region1, I can see it on the broker in region1:

tcp        0      0 10.10.3.46:6650         10.12.3.213:43300       ESTABLISHED 11/java

and on the broker in region3:

tcp        0      0 10.12.3.213:43300       10.10.3.46:6650         ESTABLISHED 11/java

So that inbound connection is actually real.

And for that specific partition, on that specific broker from region3 I have those logs:

10:21:58.694 [pulsar-io-22-13] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-1] [pulsar.repl.region3] Creating producer on cnx [id: 0x581a0460, L:/10.12.3.213:43308 - R:10.10.3.46/10.10.3.46:6650]
10:21:58.893 [pulsar-io-22-2] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x581a0460, L:/10.12.3.213:43308 - R:10.10.3.46/10.10.3.46:6650] Received error from server: Producer with name 'pulsar.repl.region3' is already connected to topic
10:21:58.893 [pulsar-io-22-2] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-1] [pulsar.repl.region3] Failed to create producer: Producer with name 'pulsar.repl.region3' is already connected to topic

And in region1 I have those logs:

10:21:58.785 [pulsar-io-22-14] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308][persistent://tenant/namespace/topic-partition-1] Creating producer. producerId=22291
10:21:58.785 [ForkJoinPool.commonPool-worker-4] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308]-22291 persistent://tenant/namespace/topic-partition-1 configured with schema false
10:21:58.785 [ForkJoinPool.commonPool-worker-4] ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308] Failed to add producer to topic persistent://tenant/namespace/topic-partition-1: Producer with name 'pulsar.repl.region3' is already connected to topic

(I picked that example of logs because they involve the same brokers as the existing replication connection reported by partitioned-stats. But similar logs are available for all 12 partitions in all brokers of both regions)

Checking partitioned stats on region3 for that partition:

      "replication" : {
        "region1" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 4.553733596940979E-5,
          "replicationBacklog" : 0,
          "connected" : false,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.10.0.23:57932",
          "inboundConnectedSince" : "2019-10-15T08:48:49.659Z"
        },
        "region2" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.11.0.182:45338",
          "inboundConnectedSince" : "2019-10-15T08:28:14.815Z",
          "outboundConnection" : "[id: 0xba518cf3, L:/10.12.1.15:37322 - R:10.11.1.120/10.11.1.120:6650]",
          "outboundConnectedSince" : "2019-10-15T09:19:45.524Z"
        },
        "region4" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.13.3.30:33686",
          "inboundConnectedSince" : "2019-10-15T08:56:40.66Z",
          "outboundConnection" : "[id: 0x0e8f64e2, L:/10.12.1.15:41864 - R:10.13.3.30/10.13.3.30:6650]",
          "outboundConnectedSince" : "2019-10-15T09:19:45.355Z"
        },
        "region5" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.14.1.64:56096",
          "inboundConnectedSince" : "2019-10-15T08:18:27.448Z",
          "outboundConnection" : "[id: 0x885ed221, L:/10.12.1.15:53210 - R:10.14.0.208/10.14.0.208:6650]",
          "outboundConnectedSince" : "2019-10-15T09:19:45.754Z"
        },
      },
      "deduplicationStatus" : "Disabled"
    }

We have indeed no outbound connection from region3 to region1.

So it would be like that broker in region3 has lost track of the existing connection to the other broker in region1 (indeed, it does not appear in partitioned stats of region3), tries to open it again but fails because brokers in region1 still have it.

Unfortunately, I do not have historical logs to check if at some point broker in region 3 tried to drop the connection to region1 but failed (keeping the tcp connection but not the information of that outbound connection).

In my previous messages I mixed things a bit (because I am discovering things at the same time).

This:

10:21:58.694 [pulsar-io-22-13] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-1] [pulsar.repl.region3] Creating producer on cnx [id: 0x581a0460, L:/10.12.3.213:43308 - R:10.10.3.46/10.10.3.46:6650]

does not mean that the broker is not seeing the connection but it means that the broker is creating the producer on a connection from its pool. So the connection exists (and this is what we see on both brokers).

I also mixed two connections on port 43300 and 43308. Indeed, there is another connection between the two above brokers:

region1:

tcp        0      0 10.10.3.46:6650         10.12.3.213:43308       ESTABLISHED 11/java

region3:

tcp        0      0 10.12.3.213:43308       10.10.3.46:6650         ESTABLISHED 11/java

So now trying to find out how I can see/reset producers on broker side and what happened exactly to get that inconsistent status / if there is a configuration that would help avoiding it

Could it be linked to the fact that, after creating the clusters and brokers, when I created the tenant/namespace namespace, I already had some clients trying to produce / consumer from the persistent://tenant/namespace/topic. Then it would have been automatically created as non partitioned. And then a few (milli)seconds later I created it partitioned causing some bad state inside brokers ?

This is basically what I did from region1 after creating all my brokers (with consumers (for sure), and maybe producers, trying to connect to that topic):

pulsar-admin namespaces create tenant/namespace --bundles 12
pulsar-admin namespaces set-clusters tenant/namespace --clusters region1,region2,region3,region4,region5
pulsar-admin namespaces set-message-ttl  tenant/namespace --messageTTL 60
pulsar-admin namespaces set-retention tenant/namespace --time 0 --size -1
pulsar-admin topics create-partitioned-topic persistent://$tenant/namespace/topic -p 12

apache / pulsar

Handle already have connected replicate produce with same name #5104