apache / helix

Mirror of Apache Helix
Apache License 2.0
457 stars 218 forks source link

[apache/helix] -- Add SetPartitionToError for participants to self annotate a node to ERROR state #2792

Closed csudharsanan closed 2 months ago

csudharsanan commented 3 months ago

Issues

Fixes #2791

Description

What: An API endpoint that validates the incoming request and sends a state transition message to sets one or more partitions from any current state to ERROR state.

Why: Currently, the participants are unable to set a partition to an ERROR state explicitly when they seem to be stuck in a specific current state. The only way a replica can be set to ERROR is from within a state model. Having an endpoint to allow this behavior would allow the clients to call the resetPartition endpoint to set it back to INIT state and recover the replica. resetPartition works only on partitions in error state.

Tests


  [INFO] ------------------------------------------------------------------------
  [INFO] Reactor Summary for Apache Helix 1.3.2-SNAPSHOT:
  [INFO] 
  [INFO] Apache Helix ....................................... SUCCESS [  1.504 s]
  [INFO] Apache Helix :: Metrics Common ..................... SUCCESS [  0.244 s]
  [INFO] Apache Helix :: Metadata Store Directory Common .... SUCCESS [  0.363 s]
  [INFO] Apache Helix :: ZooKeeper API ...................... SUCCESS [  0.380 s]
  [INFO] Apache Helix :: Helix Common ....................... SUCCESS [  0.291 s]
  [INFO] Apache Helix :: Core ............................... SUCCESS [  0.306 s]
  [INFO] Apache Helix :: Admin Webapp ....................... SUCCESS [  0.606 s]
  [INFO] Apache Helix :: Restful Interface .................. SUCCESS [  0.941 s]
  [INFO] Apache Helix :: Distributed Lock ................... SUCCESS [  0.228 s]
  [INFO] Apache Helix :: HelixAgent ......................... SUCCESS [  0.187 s]
  [INFO] Apache Helix :: Recipes ............................ SUCCESS [  0.033 s]
  [INFO] Apache Helix :: Recipes :: Rabbitmq Consumer Group . SUCCESS [  0.205 s]
  [INFO] Apache Helix :: Recipes :: Rsync Replicated File Store SUCCESS [  0.248 s]
  [INFO] Apache Helix :: Recipes :: distributed lock manager  SUCCESS [  0.169 s]
  [INFO] Apache Helix :: Recipes :: distributed task execution SUCCESS [  0.246 s]
  [INFO] Apache Helix :: Recipes :: service discovery ....... SUCCESS [  0.186 s]
  [INFO] Apache Helix :: View Aggregator .................... SUCCESS [  0.167 s]
  [INFO] Apache Helix :: Meta Client ........................ SUCCESS [  0.146 s]
  [INFO] ------------------------------------------------------------------------
  [INFO] BUILD SUCCESS
  [INFO] ------------------------------------------------------------------------
  [INFO] Total time:  9.219 s
  [INFO] Finished at: 2024-04-16T13:13:02-07:00
  [INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

(Link the GitHub wiki you added)

Commits

Code Quality

csudharsanan commented 2 months ago

Fixed the mbean issue in HelixTask. Now it supports -> transitions. Since this wasn't failing tests, adding some logs.

Before:

Start zookeeper at localhost:2183 in thread main
START TestSetPartitionsToErrorState_testSetPartitionsToErrorState at Tue May 07 12:28:02 PDT 2024
true: wait 332ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
javax.management.RuntimeOperationsException
        at java.management/com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:298)
        at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1848)
        at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:945)
        at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:880)
        at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:315)
        at java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:523)
        at org.apache.helix.monitoring.mbeans.MBeanRegistrar.register(MBeanRegistrar.java:60)
        at org.apache.helix.monitoring.mbeans.dynamicMBeans.DynamicMBeanProvider.doRegister(DynamicMBeanProvider.java:89)
        at org.apache.helix.monitoring.mbeans.dynamicMBeans.DynamicMBeanProvider.doRegister(DynamicMBeanProvider.java:95)
        at org.apache.helix.monitoring.mbeans.StateTransitionStatMonitor.register(StateTransitionStatMonitor.java:83)
        at org.apache.helix.monitoring.mbeans.ParticipantStatusMonitor.reportTransitionStat(ParticipantStatusMonitor.java:113)
        at org.apache.helix.messaging.handling.HelixTask.reportMessageStat(HelixTask.java:335)
        at org.apache.helix.messaging.handling.HelixTask.finalCleanup(HelixTask.java:386)
        at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:185)
        at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.IllegalArgumentException: Repository: cannot add mbean for pattern name CLMParticipantReport:Cluster=TestSetPartitionsToErrorState_testSetPartitionsToErrorState,Transition=*--ERROR
        ... 19 more
true: wait 233ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
true: wait 216ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
16468 [ZkClient-EventThread-162-localhost:2183] ERROR org.apache.helix.messaging.handling.HelixTaskExecutor [] - Message xyz cannot be processed: ***, {CREATE_TIMESTAMP=1715110092791, FROM_STATE=*, MSG_ID=***, MSG_STATE=new, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=TestDB0_7, RESOURCE_NAME=TestDB0, SRC_NAME=*****, STATE_MODEL_DEF=MasterSlave, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=localhost_12918, TGT_SESSION_ID=***, TO_STATE=ERROR}{}{}Partition TestDB0_7 current state is same as toState (*->ERROR) from message.
true: wait 53ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
END TestSetPartitionsToErrorState_testSetPartitionsToErrorState at Tue May 07 12:28:15 PDT 2024

After:


Start zookeeper at localhost:2183 in thread main
START TestSetPartitionsToErrorState_testSetPartitionsToErrorState at Tue May 07 12:23:24 PDT 2024
true: wait 302ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
true: wait 202ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
true: wait 185ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
16489 [ZkClient-EventThread-162-localhost:2183] ERROR org.apache.helix.messaging.handling.HelixTaskExecutor [] - Message xyz cannot be processed: ***, {CREATE_TIMESTAMP=1715109814097, FROM_STATE=*, MSG_ID=***, MSG_STATE=new, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=TestDB0_7, RESOURCE_NAME=TestDB0, SRC_NAME=*****, STATE_MODEL_DEF=MasterSlave, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=localhost_12918, TGT_SESSION_ID=***, TO_STATE=ERROR}{}{}Partition TestDB0_7 current state is same as toState (*->ERROR) from message.
true: wait 51ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
END TestSetPartitionsToErrorState_testSetPartitionsToErrorState at Tue May 07 12:23:36 PDT 2024
AfterClass: TestSetPartitionsToErrorState called.
Shut down zookeeper at port 2183 in thread main
csudharsanan commented 2 months ago

This PR is ready to be merged. This PR adds SetPartitionToError endpoint for participants to self annotate a node to ERROR state