apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.39k stars 3.68k forks source link

Leadership Election Not Happening #15491

Open humutkazan opened 9 months ago

humutkazan commented 9 months ago

We have encountered an issue recently and we think that it was related with leadership election not happening. You can find the coordinator log file attached.

In the logs you can see the log that says

2023-12-01T05:06:05,659 INFO [LeaderSelector[/druid/coordinator/_COORDINATOR]] org.apache.druid.server.coordinator.DruidCoordinator - I am the leader of the coordinators, all must bow! Starting coordination in [PT10S].

but after some time during the day it says

15:23,973 INFO [LeaderSelector[/druid/coordinator/_COORDINATOR]] org.apache.druid.server.coordinator.DruidCoordinator - I am no longer the leader...

and from this point on you could see the logs like

2023-12-01T05:15:57,065 ERROR [KafkaSupervisor-enriched-analog-Reporting-0] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Failed to get task runner because I'm not the leader!

connection time out exceptions etc.

DRUID Version = 27.0.0

0112.log

rasgele commented 9 months ago

ZK is getting online then offline and then finally online. After the last successful connection, no progress was observed. I suspect that the initial request to ZK is somehow hung, and since no timeout is set on ZK client, it waits indefinitely. Guessing this based on this logline: 2023-12-01T05:15:53,975 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled= Can that be the reason?

rasgele commented 9 months ago

May be related to this? https://stackoverflow.com/questions/66519657/java-curator-zookeeper-client-hanging-indefinitely-when-zookeeper-is-not-availab

acherla commented 9 months ago

What version of zookeeper are you using? And did you check your zookeeper quorum is actually configured correctly and stable?

rasgele commented 8 months ago

It is 3.8.1 ("zookeeper.version" : "3.8.1-74db005175a4ec545697012f9069cb9dcc8cdda7, built on 2023-01-25 16:31 UTC") In this case we're running a single instance of zookeeper.

Problem is reproducable by just restarting zookeeper. Even coordinator detects that zookeeper is back, it is not elected leader again.

I observe the same when with multiple coordinator instances as well.

acherla commented 8 months ago

Try downgrading to 3.6.x

rasgele commented 8 months ago

Try downgrading to 3.6.x

I tried with 3.6.0 but nothing seemed to change.

lejinghu commented 2 months ago

Seeing the same issue with druid 27 + zookeeper 3.7.2 and druid 29.01 + zookeeper 3.8.4