apache / helix

Mirror of Apache Helix
Apache License 2.0
457 stars 218 forks source link

Election client may failed to re-create participant ZNode when session expired. #2815

Closed xyuanlu closed 2 days ago

xyuanlu commented 1 month ago

Describe the bug

User reporting that when they make heap dump on an app that uses leader election client for leadership orchestration, the participant ZNode is gone. The node is still the leader.

To Reproduce

Run Unit test TestLeaderElection.testSessionExpire(), add a debug point in ClientCnxn.run() line

ClientCnxn.this.eventThread.queueEvent(new WatchedEvent(EventType.None, KeeperState.Closed, (String)null));

When reach this line, let it blocked for a while and continue.

Unit test will fail as participant nodes are failed to create after reconnect.

Expected behavior

Participant nodes should be recreated after reconnect.

Additional context

This is caused by a race condition between Helix ZkClient reconnect and native ZooKeeper client life cycle. ZooKeeper client will close the client when expired. CONNECTED->EXPIRED->CLOSED In Helix, ZkClient will create a new native ZooKeeper client, and close the old one when the old client has session expired. (code link https://github.com/apache/helix/blob/master/zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkConnection.java#L116)

In most cases, the previous ZooKeeper client is closed by Helix ZkClient after Helix ZkClient switched to a new client. So HelixZkClient will vener receive the "Closed" state event. Helix ZkClient will send out state change event EXPIRED->CONNECTED However, sometimes the client closes it self before Helix ZkClient switched to a new client, thus, Helix ZkClient will receive "Closed" state event. . Helix ZkClient will send out state change event EXPIRED->CLOSED->CONNECTED