Seagate / halon

High availability solution
Apache License 2.0
1 stars 0 forks source link

HALON-911: fix processes restart after node failure #1587

Closed andriytk closed 5 years ago

andriytk commented 5 years ago

It was possible that the node's processes would get stuck in a failed state with 'node failure' status on cluster startup sometimes. It happened because the node::process::start rule instance was not finishing until all the cluster processes are started up (including the client ones). As result, the attempt to restart the processes on the failed and restored node was failing because the node::process::start rule instance was 'already running'.

Now we finish the rule instance as soon as there are no more processes left to start on the node or some previously started processes on the node got failed already (due to the node failure, for example).

andriytk commented 5 years ago

changed the description

andriytk commented 5 years ago

Ok, fixed.

andriytk commented 5 years ago

There is no such function anymore.

andriytk commented 5 years ago

It's too long already.

andriytk commented 5 years ago

added 3 commits

Compare with previous version

andriytk commented 5 years ago

changed this line in version 5 of the diff

vvv commented 5 years ago

[optional]

            let exclude p =
                  return . Right $ (rlens fldWaitingProcs %~ fieldMap (filter (/= p))) l

I find the name (right) confusing.

vvv commented 5 years ago

s/Unstarted/NotOnline/ please. (Yes, I'm aware that the function was written by someone else.)

0) The name of this function is misleading. One has to see its documentation or implementation in order to understand what it actually does.

1) We exclude PSOnline, not PSStarting.

-- | Process state. This is a generalisation of what might be reported to Mero.
data ProcessState =
    PSUnknown       -- ^ Process state is not known.
  | PSOffline       -- ^ Process is stopped.
  | PSStarting      -- ^ Process is starting but we have not confirmed started.
  | PSOnline        -- ^ Process is online.
  | PSQuiescing     -- ^ Process is online, but should reject any further requests.
  | PSStopping      -- ^ Process is currently stopping.
  | PSFailed String -- ^ Process has failed, with reason given
  | PSInhibited ProcessState -- ^ Process state is masked by a higher level
                             --   failure.

2) There is no “unstarted” word in English (and ‘un-’ makes an antonym, which in this case would be “stopped”).

vvv commented 5 years ago

s/Srv/Server/ please.

vvv commented 5 years ago

@andriy.tkachuk marked as a Work In Progress

WIP: prefix in MR's title signifies that this MR should not be landed.

andriytk commented 5 years ago

marked as a Work In Progress

andriytk commented 5 years ago

added 2 commits

Compare with previous version

andriytk commented 5 years ago

added 1 commit

Compare with previous version

andriytk commented 5 years ago

added 1 commit

Compare with previous version

andriytk commented 5 years ago

assigned to @vvv

andriytk commented 5 years ago

changed the description

andriytk commented 5 years ago

changed title from HALON-911: fix processes restart{-ing-} after node failure to HALON-911: fix processes restart after node failure

vvv commented 5 years ago

resolved all discussions

vvv commented 5 years ago

Not worth it.

vvv commented 5 years ago

merged

vvv commented 5 years ago

s/m0tifs/m0t1fs/

vvv commented 5 years ago

Would you mind to

or

andriytk commented 5 years ago

added 7 commits

Compare with previous version

andriytk commented 5 years ago

resolved all discussions

andriytk commented 5 years ago

added 1 commit

Compare with previous version

andriytk commented 5 years ago

changed this line in version 6 of the diff

andriytk commented 5 years ago

unmarked as a Work In Progress