confluentinc / kafka-connect-storage-cloud

Kafka Connect suite of connectors for Cloud storage (Amazon S3)
Other
13 stars 332 forks source link

Kafka Connect - Dead workers going in cyclic mode #75

Open ismail261 opened 7 years ago

ismail261 commented 7 years ago

Scenario:

1) Statuses topic that connect group uses had some statuses which included the old worker id's which do not exist anymore. 2) Deleted all the metadata topics. Recreated them and started the connector again. 3) All the data on the status topic gets repopulated. Don't know where it comes from.

That might be the reason i see these cyclic errors where it continuously tries connecting to leader worker ip which does not even exist and is unable to start the rest server.

Changing the group id fixes the problem but i was wondering if there is any other way to approach this issue.

[2017-07-25 21:57:05,266] INFO Herder started (org.apache.kafka.connect.runtime.distributed.DistributedHerder:195) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:05,366] INFO Discovered coordinator iekfk003l.load.appia.com:9092 (id: 2147477466 rack: null) for group carrier-mysql-euwest-qacarrier-connect-cluster. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:589) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:05,467] INFO (Re-)joining group carrier-mysql-euwest-qacarrier-connect-cluster (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:423) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:14,328] INFO Successfully joined group carrier-mysql-euwest-qacarrier-connect-cluster with generation 390 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:391) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:14,368] INFO Joined group and got assignment: Assignment{error=0, leader='connect-1-5ef00613-d69a-4d02-91c4-51545d52a0f6', leaderUrl='http://10.41.162.34:8083/', offset=260, connectorIds=[mysql-kafka-processes], taskIds=[mysql-kafka-mccmnc-0]} (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1151) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:14,466] WARN Catching up to assignment's config offset. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:740) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:14,867] INFO Current config state offset -1 is behind group assignment 260, reading to end of config log (org.apache.kafka.connect.runtime.distributed.DistributedHerder:784) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:15,368] INFO Finished reading to end of log and updated config snapshot, new config log offset: -1 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:788) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:15,369] INFO Current config state offset -1 does not match group assignment 260. Forcing rebalance. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:764) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:15,369] INFO Rebalance started (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1172) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:15,369] INFO Wasn't unable to resume work after last rebalance, can skip stopping connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1204) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:15,466] INFO (Re-)joining group carrier-mysql-euwest-qacarrier-connect-cluster (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:423) [mysql-kafka-app-687132680-b56mk] Elapsed Time: 00:01:05 [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:15,766] INFO Successfully joined group carrier-mysql-euwest-qacarrier-connect-cluster with generation 390 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:391) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:15,766] INFO Joined group and got assignment: Assignment{error=0, leader='connect-1-5ef00613-d69a-4d02-91c4-51545d52a0f6', leaderUrl='http://10.41.162.34:8083/', offset=260, connectorIds=[mysql-kafka-processes], taskIds=[mysql-kafka-mccmnc-0]} (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1151) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:15,767] WARN Catching up to assignment's config offset. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:740) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:15,767] INFO Current config state offset -1 is behind group assignment 260, reading to end of config log (org.apache.kafka.connect.runtime.distributed.DistributedHerder:784) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:16,066] INFO Finished reading to end of log and updated config snapshot, new config log offset: -1 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:788) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:16,066] INFO Current config state offset -1 does not match group assignment 260. Forcing rebalance. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:764) [mysql-kafka-app-687132680-b56mk] [2017-07-25 21:57:16,067] INFO Rebalance started (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1172)

kkonstantine commented 7 years ago

This doesn't seem specific to this connector.

Did you delete all 3 internal topics that Connect uses to track metadata?

If that's what you intended to do, these topics are defined by the value of the following properties in the worker's config: offset.storage.topic config.storage.topic status.storage.topic

ismail261 commented 7 years ago

I think i found what i was doing wrong on my side. All the status values that got repopulated came from same group id which was running in different cluster with worker ids which were not accessible in this current cluster.

Changing the group id registered all the connectors in new group giving current worker id's which made them start without any issues.

I just think if there was better way to identify those issues and throw out the error on console saying what might be happening.

sheu commented 3 years ago

I think i found what i was doing wrong on my side. All the status values that got repopulated came from same group id which was running in different cluster with worker ids which were not accessible in this current cluster.

Changing the group id registered all the connectors in new group giving current worker id's which made them start without any issues.

I just think if there was better way to identify those issues and throw out the error on console saying what might be happening.

How did you change the groupId?