HALON-911: fix processes restart after node failure

andriytk commented 5 years ago

It was possible that the node's processes would get stuck in a failed state with 'node failure' status on cluster startup sometimes. It happened because the node::process::start rule instance was not finishing until all the cluster processes are started up (including the client ones). As result, the attempt to restart the processes on the failed and restored node was failing because the node::process::start rule instance was 'already running'.

Now we finish the rule instance as soon as there are no more processes left to start on the node or some previously started processes on the node got failed already (due to the node failure, for example).

andriytk commented 5 years ago

changed the description

andriytk commented 5 years ago

Ok, fixed.

andriytk commented 5 years ago

There is no such function anymore.

andriytk commented 5 years ago

It's too long already.

andriytk commented 5 years ago

added 3 commits

390b015f - mero-halon: don't try to stop already stopped process
1c713eb8 - HALON-914: fix cluster startup failure on RC failure
2a968aec - HALON-911: fix processes restart after node failure

Compare with previous version

andriytk commented 5 years ago

changed this line in version 5 of the diff

vvv commented 5 years ago

[optional]

            let exclude p =
                  return . Right $ (rlens fldWaitingProcs %~ fieldMap (filter (/= p))) l

I find the name (right) confusing.

vvv commented 5 years ago

s/Unstarted/NotOnline/ please. (Yes, I'm aware that the function was written by someone else.)

0) The name of this function is misleading. One has to see its documentation or implementation in order to understand what it actually does.

1) We exclude PSOnline, not PSStarting.

-- | Process state. This is a generalisation of what might be reported to Mero.
data ProcessState =
    PSUnknown       -- ^ Process state is not known.
  | PSOffline       -- ^ Process is stopped.
  | PSStarting      -- ^ Process is starting but we have not confirmed started.
  | PSOnline        -- ^ Process is online.
  | PSQuiescing     -- ^ Process is online, but should reject any further requests.
  | PSStopping      -- ^ Process is currently stopping.
  | PSFailed String -- ^ Process has failed, with reason given
  | PSInhibited ProcessState -- ^ Process state is masked by a higher level
                             --   failure.

2) There is no “unstarted” word in English (and ‘un-’ makes an antonym, which in this case would be “stopped”).

vvv commented 5 years ago

s/Srv/Server/ please.

vvv commented 5 years ago

@andriy.tkachuk marked as a Work In Progress

WIP: prefix in MR's title signifies that this MR should not be landed.

andriytk commented 5 years ago

marked as a Work In Progress

andriytk commented 5 years ago

added 2 commits

bbcf66b8 - HALON-911: fix processes restart after node failure
dc57d8a3 - mero-halon: don't try to stop already stopped process

Compare with previous version

andriytk commented 5 years ago

added 1 commit

5037acbc - mero-halon: don't try to stop already stopped process

Compare with previous version

andriytk commented 5 years ago

added 1 commit

4b8b4f07 - HALON-911: fix processes restar after node failure

Compare with previous version

andriytk commented 5 years ago

assigned to @vvv

andriytk commented 5 years ago

changed the description

andriytk commented 5 years ago

changed title from HALON-911: fix processes restart{-ing-} after node failure to HALON-911: fix processes restart after node failure

vvv commented 5 years ago

resolved all discussions

vvv commented 5 years ago

Not worth it.

vvv commented 5 years ago

merged

vvv commented 5 years ago

s/m0tifs/m0t1fs/

vvv commented 5 years ago

Would you mind to

rename this function to getNotOnlineSrvProcesses (see http://gitlab.mero.colo.seagate.com/mero/halon/merge_requests/1585#note_8234 for the justification)

or

inline this code in processStartProcessesOnNode.nodeFailedWith?

andriytk commented 5 years ago

added 7 commits

461903c2...16e7c697 - 4 commits from branch master
c36a2d88 - mero-halon: don't try to stop already stopped process
9b551f14 - HALON-914: fix cluster startup failure on RC failure
c040dc6c - HALON-911: fix processes restart after node failure

Compare with previous version

andriytk commented 5 years ago

resolved all discussions

andriytk commented 5 years ago

added 1 commit

461903c2 - HALON-911: fix processes restart after node failure

Compare with previous version

andriytk commented 5 years ago

changed this line in version 6 of the diff

andriytk commented 5 years ago

unmarked as a Work In Progress

Seagate / halon

HALON-911: fix processes restart after node failure #1587