[X] My PR addresses the following Helix issues and references them in the PR description:
2815
Description
[X] Here are some details about my PR, including screenshots of any UI changes:
User reporting that when they make heap dump on an app that uses leader election client for leadership orchestration, the participant ZNode is gone. The node is still the leader.
In most cases, the previous ZooKeeper client is closed by Helix ZkClient after Helix ZkClient switched to a new client. So HelixZkClient will vener receive the "Closed" state event. Helix ZkClient will send out state change event EXPIRED->CONNECTED
However, sometimes the client closes it self before Helix ZkClient switched to a new client, thus, Helix ZkClient will receive "Closed" state event. . Helix ZkClient will send out state change event EXPIRED->CLOSED->CONNECTED
Tests
[X] The following tests are written for this issue:
Reproduce this issue by adding breakpoint in ZK code. Verified before/after behavior.
The following is the result of the "mvn test" command on the appropriate module:
(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
Changes that Break Backward Compatibility (Optional)
My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:
(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)
Documentation (Optional)
In case of new functionality, my PR adds documentation in the following wiki page:
(Link the GitHub wiki you added)
Commits
My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
Subject is separated from body by a blank line
Subject is limited to 50 characters (not including Jira issue reference)
Subject does not end with a period
Subject uses the imperative mood ("add", not "adding")
Body wraps at 72 characters
Body explains "what" and "why", not "how"
Code Quality
My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)
Issues
2815
Description
User reporting that when they make heap dump on an app that uses leader election client for leadership orchestration, the participant ZNode is gone. The node is still the leader.
This is caused by a race condition between Helix ZkClient reconnect and native ZooKeeper client life cycle. ZooKeeper client will close the client when expired. CONNECTED->EXPIRED->CLOSED In Helix, ZkClient will create a new native ZooKeeper client, and close the old one when the old client has session expired. (code link https://github.com/apache/helix/blob/master/zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkConnection.java#L116)
In most cases, the previous ZooKeeper client is closed by Helix ZkClient after Helix ZkClient switched to a new client. So HelixZkClient will vener receive the "Closed" state event. Helix ZkClient will send out state change event
EXPIRED->CONNECTED
However, sometimes the client closes it self before Helix ZkClient switched to a new client, thus, Helix ZkClient will receive "Closed" state event. . Helix ZkClient will send out state change eventEXPIRED->CLOSED->CONNECTED
Tests
Reproduce this issue by adding breakpoint in ZK code. Verified before/after behavior.
(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
Changes that Break Backward Compatibility (Optional)
(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)
Documentation (Optional)
(Link the GitHub wiki you added)
Commits
Code Quality