Node failover - Githubissues

olegkovalenko commented 8 years ago

MOTIVATION:

C* node could fail due to misconfiguration, memory, etc reasons. In order to mitigate such failures and be able to handle them in automated way lets introduce failover delay, max delay and max tries.

PROPOSED CHANGE:

When a node fails, DSE mesos scheduler assumes that the failure is recoverable. The scheduler will try to restart the node after waiting failover-delay (i.e. 30s, 2m). The initial waiting delay is equal to failover-delay setting. After each consecutive failure this delay is doubled until it reaches failover-max-delay value.

If failover-max-tries is defined and the consecutive failure count exceeds it, the node will be deactivated.

The following failover settings exists:

--failover-delay     - initial failover delay to wait after failure (option value is required)
--failover-max-delay - max failover delay (option value is required)
--failover-max-tries - max failover tries to deactivate broker (to reset to unbound pass --failover-max-tries "")

CLI changes: node add and node update will allow to configure --failover-delay , --failover-max-delay , --failover-max-tries

Http server changes: /api/node/add and /api/node/update will allow to configure failoverDelay, failoverMaxDelay, failoverMaxTries

Scheduler changes:

when considering starting, stopping take into account failover is waiting delay
reset failures
- when node has been successfully started (on status update when started)
- when node bas been stopped
register failure
- when on task update received task status failed, lost, error
- stop node when exceeded max tries

C* storage changes:

add ability to store failover, introduce columns:

  node_failover_delay text,
  node_failover_max_delay text,
  node_failover_max_tries int,
  node_failover_failures int,
  node_failover_failure_time timestamp

RESULT: failover with increased delay and ability to stop node after max tries (fixes #28)

dmitrypekar commented 8 years ago

Everything looks good, except one point: imho, it would be much better to split following test methods:

NodeCliTest.handleAddUpdate - split into groups by options groups;
SchedulerTest.onTaskStopped - extract failover-related scenarios into separate method like onTaskStopped_failover;

I will also do a testing now.

dmitrypekar commented 8 years ago

Thanks for the update! Merged.

olegkovalenko commented 8 years ago

Thanks!

elodina / datastax-enterprise-mesos

Node failover #76