apache / helix

Mirror of Apache Helix
Apache License 2.0
457 stars 218 forks source link

Fix race condition when reconnect #2814

Closed xyuanlu closed 1 month ago

xyuanlu commented 1 month ago

Issues

2815

Description

User reporting that when they make heap dump on an app that uses leader election client for leadership orchestration, the participant ZNode is gone. The node is still the leader.

This is caused by a race condition between Helix ZkClient reconnect and native ZooKeeper client life cycle. ZooKeeper client will close the client when expired. CONNECTED->EXPIRED->CLOSED In Helix, ZkClient will create a new native ZooKeeper client, and close the old one when the old client has session expired. (code link https://github.com/apache/helix/blob/master/zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkConnection.java#L116)

In most cases, the previous ZooKeeper client is closed by Helix ZkClient after Helix ZkClient switched to a new client. So HelixZkClient will vener receive the "Closed" state event. Helix ZkClient will send out state change event EXPIRED->CONNECTED However, sometimes the client closes it self before Helix ZkClient switched to a new client, thus, Helix ZkClient will receive "Closed" state event. . Helix ZkClient will send out state change event EXPIRED->CLOSED->CONNECTED

Tests

Reproduce this issue by adding breakpoint in ZK code. Verified before/after behavior.

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

(Link the GitHub wiki you added)

Commits

Code Quality

xyuanlu commented 1 month ago

This PR is ready to be merged. Approved by @junkaixue

Commit message "[Fix race condition when reconnect in leader election client.]"