C* node could fail due to misconfiguration, memory, etc reasons. In order to mitigate
such failures and be able to handle them in automated way lets introduce failover delay,
max delay and max tries.
PROPOSED CHANGE:
When a node fails, DSE mesos scheduler assumes that the failure is recoverable. The scheduler will try
to restart the node after waiting failover-delay (i.e. 30s, 2m). The initial waiting delay is equal to failover-delay setting.
After each consecutive failure this delay is doubled until it reaches failover-max-delay value.
If failover-max-tries is defined and the consecutive failure count exceeds it, the node will be deactivated.
The following failover settings exists:
--failover-delay - initial failover delay to wait after failure (option value is required)
--failover-max-delay - max failover delay (option value is required)
--failover-max-tries - max failover tries to deactivate broker (to reset to unbound pass --failover-max-tries "")
CLI changes:
node add and node update will allow to configure --failover-delay , --failover-max-delay , --failover-max-tries
Http server changes:
/api/node/add and /api/node/update will allow to configure failoverDelay, failoverMaxDelay, failoverMaxTries
Scheduler changes:
when considering starting, stopping take into account failover is waiting delay
reset failures
when node has been successfully started (on status update when started)
when node bas been stopped
register failure
when on task update received task status failed, lost, error
MOTIVATION:
C* node could fail due to misconfiguration, memory, etc reasons. In order to mitigate such failures and be able to handle them in automated way lets introduce failover delay, max delay and max tries.
PROPOSED CHANGE:
When a node fails, DSE mesos scheduler assumes that the failure is recoverable. The scheduler will try to restart the node after waiting failover-delay (i.e. 30s, 2m). The initial waiting delay is equal to failover-delay setting. After each consecutive failure this delay is doubled until it reaches failover-max-delay value.
If failover-max-tries is defined and the consecutive failure count exceeds it, the node will be deactivated.
The following failover settings exists:
CLI changes:
node add
andnode update
will allow to configure--failover-delay
,--failover-max-delay
,--failover-max-tries
Http server changes:
/api/node/add
and/api/node/update
will allow to configurefailoverDelay
,failoverMaxDelay
,failoverMaxTries
Scheduler changes:
C* storage changes:
add ability to store failover, introduce columns:
RESULT: failover with increased delay and ability to stop node after max tries (fixes #28)