Closed codelipenghui closed 10 months ago
I am facing the same issue.
I have 5 datacenters with each a pulsar cluster. So that make 5 pulsar clusters (named 'region1' ... 'region5'), with geo replication for some namespaces.
09:16:10.222 [pulsar-io-22-9] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.11.0.7:51418][persistent://tenant/namespace/topic-partition-10] Creating producer. producerId=11331
09:16:10.223 [ForkJoinPool.commonPool-worker-7] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.11.0.7:51418]-11331 persistent://tenant/namespace/topic-partition-10 configured with schema false
09:16:10.224 [ForkJoinPool.commonPool-worker-7] ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.11.0.7:51418] Failed to add producer to topic persistent://tenant/namespace/topic-partition-10: Producer with name 'pulsar.repl.region2' is already connected to topic
and in the broker on the other side, in region2:
09:16:10.164 [pulsar-io-22-4] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-10] [pulsar.repl.region2] Creating producer on cnx [id: 0x7cd1d935, L:/10.11.0.7:51418 - R:10.12.0.172/10.12.0.172:6650]
09:16:10.285 [pulsar-io-22-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x7cd1d935, L:/10.11.0.7:51418 - R:10.12.0.172/10.12.0.172:6650] Received error from server: Producer with name 'pulsar.repl.region2' is already connected to topic
09:16:10.285 [pulsar-io-22-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-10] [pulsar.repl.region2] Failed to create producer: Producer with name 'pulsar.repl.region2' is already connected to topic
Right now, replication is having issues for that namespace and replications :
What I can see looking at pulsar-admin topics partitioned-stats --per-partition tenant/namespace/topic
for example for one partition in region1:
{
...
"partitions": {
"persistent://tenant/namespace/topic-partition-1" : {
"replication" : {
"region2" : {
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"msgRateExpired" : 0.0,
"replicationBacklog" : 0,
"connected" : true,
"replicationDelayInSeconds" : 0,
"inboundConnection" : "/10.11.1.120:42292",
"inboundConnectedSince" : "2019-10-15T08:56:40.286Z",
"outboundConnection" : "[id: 0x680b115b, L:/10.10.3.46:39946 - R:10.11.1.120/10.11.1.120:6650]",
"outboundConnectedSince" : "2019-10-15T09:17:03.798Z"
},
"region3" : {
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"msgRateExpired" : 0.01666736530872926,
"replicationBacklog" : 0,
"connected" : false,
"replicationDelayInSeconds" : 0,
"inboundConnection" : "/10.12.3.213:43300",
"inboundConnectedSince" : "2019-10-15T08:47:29.728Z"
},
"region4" : {
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"msgRateExpired" : 0.0,
"replicationBacklog" : 0,
"connected" : true,
"replicationDelayInSeconds" : 0,
"inboundConnection" : "/10.13.3.30:45422",
"inboundConnectedSince" : "2019-10-15T08:56:41.039Z",
"outboundConnection" : "[id: 0x27db1c8d, L:/10.10.3.46:56448 - R:10.13.3.30/10.13.3.30:6650]",
"outboundConnectedSince" : "2019-10-15T09:17:04.282Z"
},
"region5" : {
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"msgRateExpired" : 0.0,
"replicationBacklog" : 0,
"connected" : true,
"replicationDelayInSeconds" : 0,
"inboundConnection" : "/10.14.1.64:36956",
"inboundConnectedSince" : "2019-10-15T08:17:24.712Z",
"outboundConnection" : "[id: 0x674e53e5, L:/10.10.3.46:54772 - R:10.14.0.208/10.14.0.208:6650]",
"outboundConnectedSince" : "2019-10-15T09:17:03.626Z"
}
},
"deduplicationStatus" : "Disabled"
}
}
}
As we can see, there is no outbound connection for region3. Hence why there is no replication region1 -> region3
. This is one that is attempted but failing with above logs. And there is an inbound connection for all regions, reason why region3 and region5 cannot reopen it (region3 and region5 have no outbound connections for that topic).
Still digging, but I think there is a bug around here.
Additional information to understand the above logs and explanations.
IPs per region:
We have 5 brokers per region. And for that topic, 12 partitions.
Regarding the partitioned-stats from previous message and the inbound connection from region3 to region1, I can see it on the broker in region1:
tcp 0 0 10.10.3.46:6650 10.12.3.213:43300 ESTABLISHED 11/java
and on the broker in region3:
tcp 0 0 10.12.3.213:43300 10.10.3.46:6650 ESTABLISHED 11/java
So that inbound connection is actually real.
And for that specific partition, on that specific broker from region3 I have those logs:
10:21:58.694 [pulsar-io-22-13] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-1] [pulsar.repl.region3] Creating producer on cnx [id: 0x581a0460, L:/10.12.3.213:43308 - R:10.10.3.46/10.10.3.46:6650]
10:21:58.893 [pulsar-io-22-2] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x581a0460, L:/10.12.3.213:43308 - R:10.10.3.46/10.10.3.46:6650] Received error from server: Producer with name 'pulsar.repl.region3' is already connected to topic
10:21:58.893 [pulsar-io-22-2] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-1] [pulsar.repl.region3] Failed to create producer: Producer with name 'pulsar.repl.region3' is already connected to topic
And in region1 I have those logs:
10:21:58.785 [pulsar-io-22-14] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308][persistent://tenant/namespace/topic-partition-1] Creating producer. producerId=22291
10:21:58.785 [ForkJoinPool.commonPool-worker-4] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308]-22291 persistent://tenant/namespace/topic-partition-1 configured with schema false
10:21:58.785 [ForkJoinPool.commonPool-worker-4] ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308] Failed to add producer to topic persistent://tenant/namespace/topic-partition-1: Producer with name 'pulsar.repl.region3' is already connected to topic
(I picked that example of logs because they involve the same brokers as the existing replication connection reported by partitioned-stats. But similar logs are available for all 12 partitions in all brokers of both regions)
Checking partitioned stats on region3 for that partition:
"replication" : {
"region1" : {
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"msgRateExpired" : 4.553733596940979E-5,
"replicationBacklog" : 0,
"connected" : false,
"replicationDelayInSeconds" : 0,
"inboundConnection" : "/10.10.0.23:57932",
"inboundConnectedSince" : "2019-10-15T08:48:49.659Z"
},
"region2" : {
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"msgRateExpired" : 0.0,
"replicationBacklog" : 0,
"connected" : true,
"replicationDelayInSeconds" : 0,
"inboundConnection" : "/10.11.0.182:45338",
"inboundConnectedSince" : "2019-10-15T08:28:14.815Z",
"outboundConnection" : "[id: 0xba518cf3, L:/10.12.1.15:37322 - R:10.11.1.120/10.11.1.120:6650]",
"outboundConnectedSince" : "2019-10-15T09:19:45.524Z"
},
"region4" : {
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"msgRateExpired" : 0.0,
"replicationBacklog" : 0,
"connected" : true,
"replicationDelayInSeconds" : 0,
"inboundConnection" : "/10.13.3.30:33686",
"inboundConnectedSince" : "2019-10-15T08:56:40.66Z",
"outboundConnection" : "[id: 0x0e8f64e2, L:/10.12.1.15:41864 - R:10.13.3.30/10.13.3.30:6650]",
"outboundConnectedSince" : "2019-10-15T09:19:45.355Z"
},
"region5" : {
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"msgRateExpired" : 0.0,
"replicationBacklog" : 0,
"connected" : true,
"replicationDelayInSeconds" : 0,
"inboundConnection" : "/10.14.1.64:56096",
"inboundConnectedSince" : "2019-10-15T08:18:27.448Z",
"outboundConnection" : "[id: 0x885ed221, L:/10.12.1.15:53210 - R:10.14.0.208/10.14.0.208:6650]",
"outboundConnectedSince" : "2019-10-15T09:19:45.754Z"
},
},
"deduplicationStatus" : "Disabled"
}
We have indeed no outbound connection from region3 to region1.
So it would be like that broker in region3 has lost track of the existing connection to the other broker in region1 (indeed, it does not appear in partitioned stats of region3), tries to open it again but fails because brokers in region1 still have it.
Unfortunately, I do not have historical logs to check if at some point broker in region 3 tried to drop the connection to region1 but failed (keeping the tcp connection but not the information of that outbound connection).
In my previous messages I mixed things a bit (because I am discovering things at the same time).
This:
10:21:58.694 [pulsar-io-22-13] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-1] [pulsar.repl.region3] Creating producer on cnx [id: 0x581a0460, L:/10.12.3.213:43308 - R:10.10.3.46/10.10.3.46:6650]
does not mean that the broker is not seeing the connection but it means that the broker is creating the producer on a connection from its pool. So the connection exists (and this is what we see on both brokers).
I also mixed two connections on port 43300 and 43308. Indeed, there is another connection between the two above brokers:
tcp 0 0 10.10.3.46:6650 10.12.3.213:43308 ESTABLISHED 11/java
tcp 0 0 10.12.3.213:43308 10.10.3.46:6650 ESTABLISHED 11/java
So now trying to find out how I can see/reset producers on broker side and what happened exactly to get that inconsistent status / if there is a configuration that would help avoiding it
Could it be linked to the fact that, after creating the clusters and brokers, when I created the tenant/namespace
namespace, I already had some clients trying to produce / consumer from the persistent://tenant/namespace/topic
.
Then it would have been automatically created as non partitioned. And then a few (milli)seconds later I created it partitioned causing some bad state inside brokers ?
This is basically what I did from region1 after creating all my brokers (with consumers (for sure), and maybe producers, trying to connect to that topic):
pulsar-admin namespaces create tenant/namespace --bundles 12
pulsar-admin namespaces set-clusters tenant/namespace --clusters region1,region2,region3,region4,region5
pulsar-admin namespaces set-message-ttl tenant/namespace --messageTTL 60
pulsar-admin namespaces set-retention tenant/namespace --time 0 --size -1
pulsar-admin topics create-partitioned-topic persistent://$tenant/namespace/topic -p 12
https://github.com/apache/pulsar/pull/21155 https://github.com/apache/pulsar/pull/21144 Has fixed this issue.
Describe the bug Broker print the following log continuous
To Reproduce Well, it's hard to reproduce.
Expected behavior Stop create replicate producer while already have connected replicate producer with same name.