apache / helix

Mirror of Apache Helix
Apache License 2.0
462 stars 224 forks source link

Add functionality to forcefully kill an instance #2898

Closed GrantPSpencer closed 1 week ago

GrantPSpencer commented 1 week ago

Issues

N/A - New feature to forcefully kill an instance

Description

This PR adds a new feature, a HelixAdmin and Helix-rest API command to forcefully kill an instance. This is achieved by marking the instance's operation as UNKNOWN and then deleting the LIVEINSTANCE znode. This feature is intended for use in a scenario where the participant is in an unrecoverable state but is keeping an active connection with ZK. Marking the node as UNKNOWN will remove it from calculations and subsequently deleting the LIVEINSTANCE znode will cause the controller to consider it as OFFLINE. This skips the requirement that the node must process the downward state transition for topstate handoff to occur.

My current findings indicate that the LIVEINSTANCE znode will only be recreated on ZK session establishment, which occurs on initial connection and after session expiration.

The following code changes were made:

Also includes miscellaneous changes:

Tests

[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 41.027 s - in org.apache.helix.integration.TestForceKillInstance [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:16 min [INFO] Finished at: 2024-09-03T11:37:35-07:00 [INFO] ------------------------------------------------------------------------


* `testForceKillInstance` in `helix-rest/src/test/java/org/apache/helix/rest/server/TestPerInstanceAccessor.java` for Helix-Rest API

$ mvn test -o -Dtest=TestInstancesAccessor -pl=helix-rest

[INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 86.395 s - in org.apache.helix.rest.server.TestInstancesAccessor [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:40 min [INFO] Finished at: 2024-09-03T11:35:04-07:00 [INFO] ------------------------------------------------------------------------



### Changes that Break Backward Compatibility (Optional)

- My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:
N/A

### Commits

- My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
  1. Subject is separated from body by a blank line
  1. Subject is limited to 50 characters (not including Jira issue reference)
  1. Subject does not end with a period
  1. Subject uses the imperative mood ("add", not "adding")
  1. Body wraps at 72 characters
  1. Body explains "what" and "why", not "how"

### Code Quality

- My diff has been formatted using helix-style.xml 
(helix-style-intellij.xml if IntelliJ IDE is used)
GrantPSpencer commented 1 week ago

Currently triggering manual runs on my fork's CI to confirm no flaky tests

junkaixue commented 1 week ago

@GrantPSpencer ready to checkin?

GrantPSpencer commented 1 week ago

Pull request approved by: @junkaixue Commit message: Add functionality to forcefully kill an instance