Closed GrantPSpencer closed 1 week ago
Currently triggering manual runs on my fork's CI to confirm no flaky tests
@GrantPSpencer ready to checkin?
Pull request approved by: @junkaixue Commit message: Add functionality to forcefully kill an instance
Issues
N/A - New feature to forcefully kill an instance
Description
This PR adds a new feature, a HelixAdmin and Helix-rest API command to forcefully kill an instance. This is achieved by marking the instance's operation as UNKNOWN and then deleting the LIVEINSTANCE znode. This feature is intended for use in a scenario where the participant is in an unrecoverable state but is keeping an active connection with ZK. Marking the node as UNKNOWN will remove it from calculations and subsequently deleting the LIVEINSTANCE znode will cause the controller to consider it as OFFLINE. This skips the requirement that the node must process the downward state transition for topstate handoff to occur.
My current findings indicate that the LIVEINSTANCE znode will only be recreated on ZK session establishment, which occurs on initial connection and after session expiration.
The following code changes were made:
helix-core/src/main/java/org/apache/helix/HelixAdmin.java
: AddedforceKillInstance
method to the HelixAdmin interface.helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java
: Implemented theforceKillInstance
method in the ZKHelixAdmin class.helix-rest/src/main/java/org/apache/helix/rest/server/resources/helix/PerInstanceAccessor.java
: Added forceKillInstance command to the to the REST API updateInstance endpoint. Called via:Also includes miscellaneous changes:
helix-core/src/test/java/org/apache/helix/integration/rebalancer/TestInstanceOperation.java
Corrected the logger class reference.helix-rest/src/test/java/org/apache/helix/rest/server/TestPartitionAssignmentAPI.java
: Corrected the logger class reference.helix-rest/src/test/java/org/apache/helix/rest/server/AbstractTestClass.java
: Refactored resource creation logic. Added addParticipant and dropParticipant methods. Also added another test cluster to isolatetestPerInstanceAccessor
andtestInstancesAccessor
helix-rest/src/test/java/org/apache/helix/rest/server/TestInstancesAccessor.java
: Now using isolated test clusterTests
[ ] The following tests are written for this issue:
helix-core/src/test/java/org/apache/helix/integration/TestForceKillInstance.java
for HelixAdmin API[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 41.027 s - in org.apache.helix.integration.TestForceKillInstance [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:16 min [INFO] Finished at: 2024-09-03T11:37:35-07:00 [INFO] ------------------------------------------------------------------------
$ mvn test -o -Dtest=TestInstancesAccessor -pl=helix-rest
[INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 86.395 s - in org.apache.helix.rest.server.TestInstancesAccessor [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:40 min [INFO] Finished at: 2024-09-03T11:35:04-07:00 [INFO] ------------------------------------------------------------------------