apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.25k stars 3.58k forks source link

Handle already have connected replicate produce with same name #5104

Closed codelipenghui closed 10 months ago

codelipenghui commented 5 years ago

Describe the bug Broker print the following log continuous image

To Reproduce Well, it's hard to reproduce.

Expected behavior Stop create replicate producer while already have connected replicate producer with same name.

f-ld commented 5 years ago

I am facing the same issue.

I have 5 datacenters with each a pulsar cluster. So that make 5 pulsar clusters (named 'region1' ... 'region5'), with geo replication for some namespaces.

Right now, replication is having issues for that namespace and replications :

What I can see looking at pulsar-admin topics partitioned-stats --per-partition tenant/namespace/topic for example for one partition in region1:

{
   ...
  "partitions": {
    "persistent://tenant/namespace/topic-partition-1" : {
      "replication" : {
        "region2" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.11.1.120:42292",
          "inboundConnectedSince" : "2019-10-15T08:56:40.286Z",
          "outboundConnection" : "[id: 0x680b115b, L:/10.10.3.46:39946 - R:10.11.1.120/10.11.1.120:6650]",
          "outboundConnectedSince" : "2019-10-15T09:17:03.798Z"
        },
        "region3" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.01666736530872926,
          "replicationBacklog" : 0,
          "connected" : false,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.12.3.213:43300",
          "inboundConnectedSince" : "2019-10-15T08:47:29.728Z"
        },
        "region4" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.13.3.30:45422",
          "inboundConnectedSince" : "2019-10-15T08:56:41.039Z",
          "outboundConnection" : "[id: 0x27db1c8d, L:/10.10.3.46:56448 - R:10.13.3.30/10.13.3.30:6650]",
          "outboundConnectedSince" : "2019-10-15T09:17:04.282Z"
        },
        "region5" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.14.1.64:36956",
          "inboundConnectedSince" : "2019-10-15T08:17:24.712Z",
          "outboundConnection" : "[id: 0x674e53e5, L:/10.10.3.46:54772 - R:10.14.0.208/10.14.0.208:6650]",
          "outboundConnectedSince" : "2019-10-15T09:17:03.626Z"
        }
      },
      "deduplicationStatus" : "Disabled"
    }
  }
}

As we can see, there is no outbound connection for region3. Hence why there is no replication region1 -> region3. This is one that is attempted but failing with above logs. And there is an inbound connection for all regions, reason why region3 and region5 cannot reopen it (region3 and region5 have no outbound connections for that topic).

Still digging, but I think there is a bug around here.

f-ld commented 5 years ago

Additional information to understand the above logs and explanations.

IPs per region:

We have 5 brokers per region. And for that topic, 12 partitions.

Regarding the partitioned-stats from previous message and the inbound connection from region3 to region1, I can see it on the broker in region1:

tcp        0      0 10.10.3.46:6650         10.12.3.213:43300       ESTABLISHED 11/java             

and on the broker in region3:

tcp        0      0 10.12.3.213:43300       10.10.3.46:6650         ESTABLISHED 11/java             

So that inbound connection is actually real.

And for that specific partition, on that specific broker from region3 I have those logs:

10:21:58.694 [pulsar-io-22-13] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-1] [pulsar.repl.region3] Creating producer on cnx [id: 0x581a0460, L:/10.12.3.213:43308 - R:10.10.3.46/10.10.3.46:6650]
10:21:58.893 [pulsar-io-22-2] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x581a0460, L:/10.12.3.213:43308 - R:10.10.3.46/10.10.3.46:6650] Received error from server: Producer with name 'pulsar.repl.region3' is already connected to topic
10:21:58.893 [pulsar-io-22-2] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-1] [pulsar.repl.region3] Failed to create producer: Producer with name 'pulsar.repl.region3' is already connected to topic

And in region1 I have those logs:

10:21:58.785 [pulsar-io-22-14] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308][persistent://tenant/namespace/topic-partition-1] Creating producer. producerId=22291
10:21:58.785 [ForkJoinPool.commonPool-worker-4] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308]-22291 persistent://tenant/namespace/topic-partition-1 configured with schema false
10:21:58.785 [ForkJoinPool.commonPool-worker-4] ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308] Failed to add producer to topic persistent://tenant/namespace/topic-partition-1: Producer with name 'pulsar.repl.region3' is already connected to topic

(I picked that example of logs because they involve the same brokers as the existing replication connection reported by partitioned-stats. But similar logs are available for all 12 partitions in all brokers of both regions)

Checking partitioned stats on region3 for that partition:

      "replication" : {
        "region1" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 4.553733596940979E-5,
          "replicationBacklog" : 0,
          "connected" : false,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.10.0.23:57932",
          "inboundConnectedSince" : "2019-10-15T08:48:49.659Z"
        },
        "region2" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.11.0.182:45338",
          "inboundConnectedSince" : "2019-10-15T08:28:14.815Z",
          "outboundConnection" : "[id: 0xba518cf3, L:/10.12.1.15:37322 - R:10.11.1.120/10.11.1.120:6650]",
          "outboundConnectedSince" : "2019-10-15T09:19:45.524Z"
        },
        "region4" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.13.3.30:33686",
          "inboundConnectedSince" : "2019-10-15T08:56:40.66Z",
          "outboundConnection" : "[id: 0x0e8f64e2, L:/10.12.1.15:41864 - R:10.13.3.30/10.13.3.30:6650]",
          "outboundConnectedSince" : "2019-10-15T09:19:45.355Z"
        },
        "region5" : {
          "msgRateIn" : 0.0,
          "msgThroughputIn" : 0.0,
          "msgRateOut" : 0.0,
          "msgThroughputOut" : 0.0,
          "msgRateExpired" : 0.0,
          "replicationBacklog" : 0,
          "connected" : true,
          "replicationDelayInSeconds" : 0,
          "inboundConnection" : "/10.14.1.64:56096",
          "inboundConnectedSince" : "2019-10-15T08:18:27.448Z",
          "outboundConnection" : "[id: 0x885ed221, L:/10.12.1.15:53210 - R:10.14.0.208/10.14.0.208:6650]",
          "outboundConnectedSince" : "2019-10-15T09:19:45.754Z"
        },
      },
      "deduplicationStatus" : "Disabled"
    }

We have indeed no outbound connection from region3 to region1.

So it would be like that broker in region3 has lost track of the existing connection to the other broker in region1 (indeed, it does not appear in partitioned stats of region3), tries to open it again but fails because brokers in region1 still have it.

Unfortunately, I do not have historical logs to check if at some point broker in region 3 tried to drop the connection to region1 but failed (keeping the tcp connection but not the information of that outbound connection).

f-ld commented 5 years ago

In my previous messages I mixed things a bit (because I am discovering things at the same time).

This:

10:21:58.694 [pulsar-io-22-13] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-1] [pulsar.repl.region3] Creating producer on cnx [id: 0x581a0460, L:/10.12.3.213:43308 - R:10.10.3.46/10.10.3.46:6650]

does not mean that the broker is not seeing the connection but it means that the broker is creating the producer on a connection from its pool. So the connection exists (and this is what we see on both brokers).

I also mixed two connections on port 43300 and 43308. Indeed, there is another connection between the two above brokers:

So now trying to find out how I can see/reset producers on broker side and what happened exactly to get that inconsistent status / if there is a configuration that would help avoiding it

f-ld commented 5 years ago

Could it be linked to the fact that, after creating the clusters and brokers, when I created the tenant/namespace namespace, I already had some clients trying to produce / consumer from the persistent://tenant/namespace/topic. Then it would have been automatically created as non partitioned. And then a few (milli)seconds later I created it partitioned causing some bad state inside brokers ?

This is basically what I did from region1 after creating all my brokers (with consumers (for sure), and maybe producers, trying to connect to that topic):

pulsar-admin namespaces create tenant/namespace --bundles 12
pulsar-admin namespaces set-clusters tenant/namespace --clusters region1,region2,region3,region4,region5
pulsar-admin namespaces set-message-ttl  tenant/namespace --messageTTL 60
pulsar-admin namespaces set-retention tenant/namespace --time 0 --size -1
pulsar-admin topics create-partitioned-topic persistent://$tenant/namespace/topic -p 12
Technoboy- commented 10 months ago

https://github.com/apache/pulsar/pull/21155 https://github.com/apache/pulsar/pull/21144 Has fixed this issue.