[5.9] Prevent further scale in/out process when a previous one failed, independently of the lock/unlock state of the config

mathieucarbou commented 1 year ago

Introduced some validations to prevent scale in/out operation to be performed if one failed previously
Introduced a repair mode for the user to allow the operation to proceed with config-tool repair -force allow_scaling
Refactored the lock/unlock logic in attach and detach to correctly catch exceptions and add markers to deny any further scale in/out in case of errors
Introduced some flags in the diagnostic output to show if scale in/out process is allowed or not
Refactored the inheritance logic for attach and detach between OSS and EE to consolidate the lock/unlock/marker logic at one place and only let EE override the difference in behaviour

detach command (CLI)

validation: fails if we find: a deny scale in marker
lock
trigger rebalancing
- on failure:
  - try to place a marker deny scale in to prevent replaying the detach
  - try to unlock

on rebalancing success (server-side)

on rebalancing failure (server-side)

attach command (CLI)

validations
- fails if we find: a deny scale out marker
lock
attach
- on failure:
  - try to place a marker deny scale out to prevent replaying the attach
  - try to unlock
  - in any case, either a marker is placed or the config is kept locked
trigger rebalancing on nomad success
- on failure:
  - config is kept locked

on rebalancing success (server-side)

on rebalancing failure (server-side)

Because attach and detach are triggering 2 Nomad tx to lock and unlock (discovery/prepare/commit) and replaying would cause 2 problems:

un-necessary append-log entries would fill the append-log
un-necessary nomad transactions triggered would increase the chance to collide with a concurrent user transaction which aims are doing a valid config change or repair

We can re-allow a scale op to be retried by running:

config-tool repair -force allow_scaling

mathieucarbou commented 1 year ago

@mobasherul @chrisdennis @jhouserizer : FYI, I've updated the description above to show the exact flow and error handling and how the markers work.

mathieucarbou commented 1 year ago

@mobasherul @chrisdennis : ready for review.

Terracotta-OSS / terracotta-platform