Seagate / halon

High availability solution
Apache License 2.0
1 stars 0 forks source link

HALON-877: fix principal RM update on m0d crash #1510

Closed 1468ca0b-2a64-4fb4-8e52-ea5806644b4c closed 5 years ago

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

👍

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

Move this definition to the bottom of the file, where most of functions are, or to line 851, under BootLevel definitions.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

  1. [optional] The assortment of (>>=), let ... in and do looks untidy. Suggestion:

    -- | Pick a Principal RM out of the available RM services.
    pickPrincipalRM :: PhaseM RC l (Maybe M0.Service)
    pickPrincipalRM = do
    rg <- getGraph
    let rms = [ svc
            | proc <- M0.getM0Processes rg
            , G.isConnected proc Is M0.PSOnline rg
            , let svcTypes = M0.s_type <$> G.connectedTo proc M0.IsParentOf rg
            , CST_CONFD `elem` svcTypes
            , svc :: M0.Service <- G.connectedTo proc M0.IsParentOf rg
            , M0.s_type svc == CST_RMS
            ]
    Log.rcLog' Log.DEBUG $ "available RM services: " ++ show rms
    traverse setPrincipalRMIfUnset (listToMaybe rms)
  2. AFAIU, the elements of rms may have different ServiceState. Shouldn't we improve the implementation so that it tries to find M0.SSOnline RM service?

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

most of the other functions in this file are monadic, so it will break the style pattern.

That g suffix still feels like an eyesore.

👇 How about this?

principalRM :: G.Graph -> Maybe M0.Service
principalRM rg = case G.connectedFrom Is M0.PrincipalRM rg of
  Just svc | M0.getState svc rg == M0.SSOnline -> Just svc
  _ -> Nothing
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

Ok.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

Yes, because Entrypoint Reply should always reply even when the cluster is not fully booted yet and it should contain some RM in the reply in any case. As far as I understand from how it used to be.

(Thanks about parentheses - removed them.)

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

The "good" is meant in this context only. Good to try to get the Online RM service.

Anyway, will change it as you suggest.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

I tried the 1st variant before but without the rg argument (pointfree style). It did not work for some reason, so I gave up with it. :)

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

Ok.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

Ok.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

Ok.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

The thing is - most of the other functions in this file are monadic, so it will break the style pattern.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

Getting nothing is possible. :) Anyway, will do as you suggest.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

The name of the function is misleading, because you are not returning BoolLevel. AFAICS, getM0* functions return resource objects and let the user unwrap them.

Consider replacing it with getM0BoolLevelValue.

getM0BoolLevelValue :: G.Graph -> Maybe Int
getM0BoolLevelValue = fmap M0.unBoolLevel . G.connectedTo Cluster M0.RunLevel
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

1) You don't need parentheses around Just svc.

2)

not (isGoodBootLevel g) || Just svc == getPrincipalRMg g

means “BoolLevel is not connected to Cluster || BoolLevel 0 is connected to Cluster || svc is PrincipalRM”.

Is this what you need? You want the resulting list to contain parameters of RM services that haven't started yet?

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

There is nothing good or bad about boot level.

-- | Process boot level.
--   This is used both to tag processes (to indicate when they should start/stop)
--   and to tag the cluster (to indicate which processes it's valid to try to
--   start/stop).
--   Given a cluster run level of x, it is valid to start a process with a
--   boot level of <= x. So at level 0 we may start confd processes, at level
--   1 we may start IOS etc as well as confd processes.
-- Currently:
--   * 0 - confd
--   * 1 - other
newtype BootLevel = BootLevel { unBootLevel :: Int }
  deriving (Eq, Ord, Show, Generic, Hashable, Typeable, FromJSON, ToJSON)

I would suggest renaming this function to confdsHaveStarted, or rmsHaveStarted, or principalRMisElected.

rmsHaveStarted :: G.Graph -> Bool
rmsHaveStarted = maybe False ((> 0) . M0.unBootLevel) . getM0BootLevel
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

getM0BoolLevel rg = unBoolLevel <$> G.connectedTo R.Cluster RunLevel rg

or

getM0BoolLevel = fmap unBoolLevel . G.connectedTo R.Cluster RunLevel
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

s/RMS/RM services/

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

s/RMS/PrincipalRM/

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

getPrincipalRMM = getPrincipalRM <$> getGraph
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

This g suffix is weird. Convention is to use M suffix for monadic function and no suffix for pure function. Suggestion: rename getPrincipalRM -> getPrincipalRMM, getPrincipalRMg -> getPrincipalRM.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

Above I mentioned, that the 2nd update of RM when the node becomes Online back causes bad effect on the cluster (Mero processes crash, see MERO-2876). But, as appeared, the crashes happen even when RM is not updated the 2nd time (after my latest patch).

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

[optional] s/got // (because you may "get" Nothing)

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

I've noticed that the principal RM is updated again when the crashed m0d becomes alive. It causes some strange effect on the cluster - many m0d processes are restarted and clients may fail at all. (See console log below.) So we should probably do something about this before landing the patch. I don't see why the second update of principal RM is necessary, so maybe we should implement some logic to avoid it.

Console log:

16:16 vagrant@cmu:halon$
16:16 vagrant@cmu:halon$ hctl mero bootstrap
...
16:19 vagrant@cmu:halon$ hctl mero status | more
Cluster disposition: ONLINE
  cluster info:
    SNS pool:   0x6f00000000000001:0xd8 "default"
    DIX pool:   0x6f00000000000001:0x12d
    profile:    0x7000000000000001:0x153

Hosts:
  [   online] 0x6e00000000000001:0xe    client1
  [   online] 0x7200000000000001:0xf      172.28.128.13@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x10       CST_HA
  [   online] 0x7300000000000001:0x11       CST_RMS
  [   online] 0x7200000000000001:0x12     172.28.128.13@tcp:12345:41:301 m0t1fs
  [   online] 0x7300000000000001:0x13       CST_RMS
  [   online] 0x6e00000000000001:0x14   cmu
  [   online] 0x7200000000000001:0x15     172.28.128.5@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x16       CST_HA
  [   online] 0x7300000000000001:0x17       CST_RMS
  [      N/A] 0x7200000000000001:0x18     172.28.128.5@tcp:12345:41:302 clovis-app
  [      N/A] 0x7300000000000001:0x19       CST_RMS
  [      N/A] 0x7200000000000001:0x1a     172.28.128.5@tcp:12345:41:303 clovis-app
  [      N/A] 0x7300000000000001:0x1b       CST_RMS
  [      N/A] 0x7200000000000001:0x1c     172.28.128.5@tcp:12345:41:304 clovis-app
  [      N/A] 0x7300000000000001:0x1d       CST_RMS
  [      N/A] 0x7200000000000001:0x1e     172.28.128.5@tcp:12345:41:305 clovis-app
  [      N/A] 0x7300000000000001:0x1f       CST_RMS
  [   online] 0x6e00000000000001:0x20   ssu1
  [   online] 0x7200000000000001:0x2e     172.28.128.3@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x2f       CST_HA
  [   online] 0x7300000000000001:0x30       CST_RMS
  [   online] 0x7200000000000001:0x31     172.28.128.3@tcp:12345:44:101 confd
  [   online] 0x7300000000000001:0x32       CST_CONFD
  [   online] 0x7300000000000001:0x33       CST_RMS
  [   online] 0x7200000000000001:0x34     172.28.128.3@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0x35       CST_RMS
  [   online] 0x7300000000000001:0x36       CST_IOS
  [   online] 0x7300000000000001:0x37       CST_SNS_REP
  [   online] 0x7300000000000001:0x38       CST_SNS_REB
  [   online] 0x7300000000000001:0x39       CST_ADDB2
  [   online] 0x7300000000000001:0x3a       CST_CAS
  [   online] 0x7300000000000001:0x3b       CST_ISCS
  [   online] 0x6e00000000000001:0x3c   ssu2
  [   online] 0x7200000000000001:0x4a     172.28.128.8@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x4b       CST_HA
  [   online] 0x7300000000000001:0x4c       CST_RMS
  [   online] 0x7200000000000001:0x4d     172.28.128.8@tcp:12345:44:101 confd
  [   online] 0x7300000000000001:0x4e       CST_CONFD
  [   online] 0x7300000000000001:0x4f       CST_RMS
  [   online] 0x7200000000000001:0x50     172.28.128.8@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0x51       CST_RMS
  [   online] 0x7300000000000001:0x52       CST_IOS
  [   online] 0x7300000000000001:0x53       CST_SNS_REP
  [   online] 0x7300000000000001:0x54       CST_SNS_REB
  [   online] 0x7300000000000001:0x55       CST_ADDB2
  [   online] 0x7300000000000001:0x56       CST_CAS
  [   online] 0x7300000000000001:0x57       CST_ISCS
  [   online] 0x6e00000000000001:0x58   ssu3
  [   online] 0x7200000000000001:0x66     172.28.128.7@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x67       CST_HA
  [   online] 0x7300000000000001:0x68       CST_RMS
  [   online] 0x7200000000000001:0x69     172.28.128.7@tcp:12345:44:101 confd
  [   online] 0x7300000000000001:0x6a       CST_CONFD
  [   online] 0x7300000000000001:0x6b       CST_RMS
  [   online] 0x7200000000000001:0x6c     172.28.128.7@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0x6d       CST_RMS
  [   online] 0x7300000000000001:0x6e       CST_IOS
  [   online] 0x7300000000000001:0x6f       CST_SNS_REP
  [   online] 0x7300000000000001:0x70       CST_SNS_REB
  [   online] 0x7300000000000001:0x71       CST_ADDB2
  [   online] 0x7300000000000001:0x72       CST_CAS
  [   online] 0x7300000000000001:0x73       CST_ISCS
  [   online] 0x6e00000000000001:0x74   ssu4
  [   online] 0x7200000000000001:0x82     172.28.128.10@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x83       CST_HA
  [   online] 0x7300000000000001:0x84       CST_RMS
  [   online] 0x7200000000000001:0x85     172.28.128.10@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0x86       CST_RMS
  [   online] 0x7300000000000001:0x87       CST_IOS
  [   online] 0x7300000000000001:0x88       CST_SNS_REP
  [   online] 0x7300000000000001:0x89       CST_SNS_REB
  [   online] 0x7300000000000001:0x8a       CST_ADDB2
  [   online] 0x7300000000000001:0x8b       CST_CAS
  [   online] 0x7300000000000001:0x8c       CST_ISCS
  [   online] 0x6e00000000000001:0x8d   ssu5
  [   online] 0x7200000000000001:0x9b     172.28.128.11@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x9c       CST_HA
  [   online] 0x7300000000000001:0x9d       CST_RMS
  [   online] 0x7200000000000001:0x9e     172.28.128.11@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0x9f       CST_RMS
  [   online] 0x7300000000000001:0xa0       CST_IOS
  [   online] 0x7300000000000001:0xa1       CST_SNS_REP
  [   online] 0x7300000000000001:0xa2       CST_SNS_REB
  [   online] 0x7300000000000001:0xa3       CST_ADDB2
  [   online] 0x7300000000000001:0xa4       CST_CAS
  [   online] 0x7300000000000001:0xa5       CST_ISCS
  [   online] 0x6e00000000000001:0xa6   ssu6
  [   online] 0x7200000000000001:0xb4     172.28.128.12@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0xb5       CST_HA
  [   online] 0x7300000000000001:0xb6       CST_RMS
  [   online] 0x7200000000000001:0xb7     172.28.128.12@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0xb8       CST_RMS
  [   online] 0x7300000000000001:0xb9       CST_IOS
  [   online] 0x7300000000000001:0xba       CST_SNS_REP
  [   online] 0x7300000000000001:0xbb       CST_SNS_REB
  [   online] 0x7300000000000001:0xbc       CST_ADDB2
  [   online] 0x7300000000000001:0xbd       CST_CAS
  [   online] 0x7300000000000001:0xbe       CST_ISCS
  [   online] 0x6e00000000000001:0xbf   ssu7
  [   online] 0x7200000000000001:0xcd     172.28.128.9@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0xce       CST_HA
  [   online] 0x7300000000000001:0xcf       CST_RMS
  [   online] 0x7200000000000001:0xd0     172.28.128.9@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0xd1       CST_RMS
  [   online] 0x7300000000000001:0xd2       CST_IOS
  [   online] 0x7300000000000001:0xd3       CST_SNS_REP
  [   online] 0x7300000000000001:0xd4       CST_SNS_REB
  [   online] 0x7300000000000001:0xd5       CST_ADDB2
  [   online] 0x7300000000000001:0xd6       CST_CAS
  [   online] 0x7300000000000001:0xd7       CST_ISCS
16:20 vagrant@cmu:halon$ ssh ssu2.local 'sudo pkill -9 halond; sudo systemctl stop halond; sudo pkill -9 m0d'
16:20 vagrant@cmu:halon$ pdsh -w cmu.local,ssu[1-7].local,client1.local sudo journalctl -u halond --since today | grep entry | sort -k 4 | tail -1
client1: Jan 31 16:18:53 client1 halond[7984]: Thu Jan 31 16:18:53 UTC 2019 pid://172.28.128.13:9070:0:131: ha_entrypoint: succeeded: SpielAddress {sa_confds_fid = [0x7300000000000001:0x4e,0x7300000000000001:0x6a,0x7300000000000001:0x32], sa_confds_ep = ["172.28.128.8@tcp:12345:44:101","172.28.128.7@tcp:12345:44:101","172.28.128.3@tcp:12345:44:101"], sa_rm_fid = 0x7300000000000001:0x4f, sa_rm_ep = "172.28.128.8@tcp:12345:44:101", sa_quorum = 2}
16:20 vagrant@cmu:halon$
16:20 vagrant@cmu:halon$
16:20 vagrant@cmu:halon$ pdsh -w cmu.local,ssu[1-7].local,client1.local sudo journalctl -u halond --since today | grep entry | sort -k 4 | tail -1
client1: Jan 31 16:18:53 client1 halond[7984]: Thu Jan 31 16:18:53 UTC 2019 pid://172.28.128.13:9070:0:131: ha_entrypoint: succeeded: SpielAddress {sa_confds_fid = [0x7300000000000001:0x4e,0x7300000000000001:0x6a,0x7300000000000001:0x32], sa_confds_ep = ["172.28.128.8@tcp:12345:44:101","172.28.128.7@tcp:12345:44:101","172.28.128.3@tcp:12345:44:101"], sa_rm_fid = 0x7300000000000001:0x4f, sa_rm_ep = "172.28.128.8@tcp:12345:44:101", sa_quorum = 2}
16:20 vagrant@cmu:halon$
16:20 vagrant@cmu:halon$
16:20 vagrant@cmu:halon$ hctl mero status | more
Cluster disposition: ONLINE
  cluster info:
    SNS pool:   0x6f00000000000001:0xd8 "default"
    DIX pool:   0x6f00000000000001:0x12d
    profile:    0x7000000000000001:0x153

Hosts:
  [   online] 0x6e00000000000001:0xe    client1
  [   online] 0x7200000000000001:0xf      172.28.128.13@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x10       CST_HA
  [   online] 0x7300000000000001:0x11       CST_RMS
  [   online] 0x7200000000000001:0x12     172.28.128.13@tcp:12345:41:301 m0t1fs
  [   online] 0x7300000000000001:0x13       CST_RMS
  [   online] 0x6e00000000000001:0x14   cmu
  [   online] 0x7200000000000001:0x15     172.28.128.5@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x16       CST_HA
  [   online] 0x7300000000000001:0x17       CST_RMS
  [      N/A] 0x7200000000000001:0x18     172.28.128.5@tcp:12345:41:302 clovis-app
  [      N/A] 0x7300000000000001:0x19       CST_RMS
  [      N/A] 0x7200000000000001:0x1a     172.28.128.5@tcp:12345:41:303 clovis-app
  [      N/A] 0x7300000000000001:0x1b       CST_RMS
  [      N/A] 0x7200000000000001:0x1c     172.28.128.5@tcp:12345:41:304 clovis-app
  [      N/A] 0x7300000000000001:0x1d       CST_RMS
  [      N/A] 0x7200000000000001:0x1e     172.28.128.5@tcp:12345:41:305 clovis-app
  [      N/A] 0x7300000000000001:0x1f       CST_RMS
  [   online] 0x6e00000000000001:0x20   ssu1
  [   online] 0x7200000000000001:0x2e     172.28.128.3@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x2f       CST_HA
  [   online] 0x7300000000000001:0x30       CST_RMS
  [   online] 0x7200000000000001:0x31     172.28.128.3@tcp:12345:44:101 confd
  [   online] 0x7300000000000001:0x32       CST_CONFD
  [   online] 0x7300000000000001:0x33       CST_RMS
  [   online] 0x7200000000000001:0x34     172.28.128.3@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0x35       CST_RMS
  [   online] 0x7300000000000001:0x36       CST_IOS
  [   online] 0x7300000000000001:0x37       CST_SNS_REP
  [   online] 0x7300000000000001:0x38       CST_SNS_REB
  [   online] 0x7300000000000001:0x39       CST_ADDB2
  [   online] 0x7300000000000001:0x3a       CST_CAS
  [   online] 0x7300000000000001:0x3b       CST_ISCS
  [   failed] 0x6e00000000000001:0x3c   ssu2
                Extended state: failed(recoverable)
  [inhibited] 0x7200000000000001:0x4a     172.28.128.8@tcp:12345:34:101 halon
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x4b       CST_HA
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x4c       CST_RMS
                Extended state: inhibited (online)
  [inhibited] 0x7200000000000001:0x4d     172.28.128.8@tcp:12345:44:101 confd
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x4e       CST_CONFD
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x4f       CST_RMS
                Extended state: inhibited (online)
  [inhibited] 0x7200000000000001:0x50     172.28.128.8@tcp:12345:41:401 ioservice
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x51       CST_RMS
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x52       CST_IOS
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x53       CST_SNS_REP
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x54       CST_SNS_REB
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x55       CST_ADDB2
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x56       CST_CAS
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x57       CST_ISCS
                Extended state: inhibited (online)
  [   online] 0x6e00000000000001:0x58   ssu3
  [   online] 0x7200000000000001:0x66     172.28.128.7@tcp:12345:34:101 halon
16:21 vagrant@cmu:halon$ pdsh -w cmu.local,ssu[1-7].local,client1.local sudo journalctl -u halond --since today | grep entry | sort -k 4 | tail -1
ssu4: Jan 31 16:21:16 ssu4 halond[7991]: Thu Jan 31 16:21:16 UTC 2019 pid://172.28.128.10:9070:0:381: ha_entrypoint: succeeded: SpielAddress {sa_confds_fid = [0x7300000000000001:0x4e,0x7300000000000001:0x6a,0x7300000000000001:0x32], sa_confds_ep = ["172.28.128.8@tcp:12345:44:101","172.28.128.7@tcp:12345:44:101","172.28.128.3@tcp:12345:44:101"], sa_rm_fid = 0x7300000000000001:0x6b, sa_rm_ep = "172.28.128.7@tcp:12345:44:101", sa_quorum = 2}
16:21 vagrant@cmu:halon$ ssh ssu2.local sudo systemctl restart halond
16:22 vagrant@cmu:halon$
16:22 vagrant@cmu:halon$
16:22 vagrant@cmu:halon$ hctl mero status | more
Cluster disposition: ONLINE
  cluster info:
    SNS pool:   0x6f00000000000001:0xd8 "default"
    DIX pool:   0x6f00000000000001:0x12d
    profile:    0x7000000000000001:0x153

Hosts:
  [   failed] 0x6e00000000000001:0xe    client1
                Extended state: failed(recoverable)
  [inhibited] 0x7200000000000001:0xf      172.28.128.13@tcp:12345:34:101 halon
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x10       CST_HA
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x11       CST_RMS
                Extended state: inhibited (online)
  [inhibited] 0x7200000000000001:0x12     172.28.128.13@tcp:12345:41:301 m0t1fs
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x13       CST_RMS
                Extended state: inhibited (online)
  [   online] 0x6e00000000000001:0x14   cmu
  [   online] 0x7200000000000001:0x15     172.28.128.5@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x16       CST_HA
  [   online] 0x7300000000000001:0x17       CST_RMS
  [      N/A] 0x7200000000000001:0x18     172.28.128.5@tcp:12345:41:302 clovis-app
  [      N/A] 0x7300000000000001:0x19       CST_RMS
  [      N/A] 0x7200000000000001:0x1a     172.28.128.5@tcp:12345:41:303 clovis-app
  [      N/A] 0x7300000000000001:0x1b       CST_RMS
  [      N/A] 0x7200000000000001:0x1c     172.28.128.5@tcp:12345:41:304 clovis-app
  [      N/A] 0x7300000000000001:0x1d       CST_RMS
  [      N/A] 0x7200000000000001:0x1e     172.28.128.5@tcp:12345:41:305 clovis-app
  [      N/A] 0x7300000000000001:0x1f       CST_RMS
  [   online] 0x6e00000000000001:0x20   ssu1
  [   online] 0x7200000000000001:0x2e     172.28.128.3@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x2f       CST_HA
  [   online] 0x7300000000000001:0x30       CST_RMS
  [   online] 0x7200000000000001:0x31     172.28.128.3@tcp:12345:44:101 confd
  [   online] 0x7300000000000001:0x32       CST_CONFD
  [   online] 0x7300000000000001:0x33       CST_RMS
  [quiescing] 0x7200000000000001:0x34     172.28.128.3@tcp:12345:41:401 ioservice
  [inhibited] 0x7300000000000001:0x35       CST_RMS
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x36       CST_IOS
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x37       CST_SNS_REP
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x38       CST_SNS_REB
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x39       CST_ADDB2
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x3a       CST_CAS
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x3b       CST_ISCS
                Extended state: inhibited (starting)
  [   online] 0x6e00000000000001:0x3c   ssu2
  [   online] 0x7200000000000001:0x4a     172.28.128.8@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x4b       CST_HA
  [   online] 0x7300000000000001:0x4c       CST_RMS
  [   online] 0x7200000000000001:0x4d     172.28.128.8@tcp:12345:44:101 confd
  [   online] 0x7300000000000001:0x4e       CST_CONFD
  [   online] 0x7300000000000001:0x4f       CST_RMS
  [ starting] 0x7200000000000001:0x50     172.28.128.8@tcp:12345:41:401 ioservice
  [ starting] 0x7300000000000001:0x51       CST_RMS
  [ starting] 0x7300000000000001:0x52       CST_IOS
  [ starting] 0x7300000000000001:0x53       CST_SNS_REP
  [ starting] 0x7300000000000001:0x54       CST_SNS_REB
  [ starting] 0x7300000000000001:0x55       CST_ADDB2
  [ starting] 0x7300000000000001:0x56       CST_CAS
  [ starting] 0x7300000000000001:0x57       CST_ISCS
  [   online] 0x6e00000000000001:0x58   ssu3
  [   online] 0x7200000000000001:0x66     172.28.128.7@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x67       CST_HA
  [   online] 0x7300000000000001:0x68       CST_RMS
  [   online] 0x7200000000000001:0x69     172.28.128.7@tcp:12345:44:101 confd
  [   online] 0x7300000000000001:0x6a       CST_CONFD
  [   online] 0x7300000000000001:0x6b       CST_RMS
  [ starting] 0x7200000000000001:0x6c     172.28.128.7@tcp:12345:41:401 ioservice
  [ starting] 0x7300000000000001:0x6d       CST_RMS
  [ starting] 0x7300000000000001:0x6e       CST_IOS
  [ starting] 0x7300000000000001:0x6f       CST_SNS_REP
  [ starting] 0x7300000000000001:0x70       CST_SNS_REB
  [ starting] 0x7300000000000001:0x71       CST_ADDB2
  [ starting] 0x7300000000000001:0x72       CST_CAS
  [ starting] 0x7300000000000001:0x73       CST_ISCS
  [   online] 0x6e00000000000001:0x74   ssu4
  [   online] 0x7200000000000001:0x82     172.28.128.10@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x83       CST_HA
  [   online] 0x7300000000000001:0x84       CST_RMS
  [quiescing] 0x7200000000000001:0x85     172.28.128.10@tcp:12345:41:401 ioservice
  [inhibited] 0x7300000000000001:0x86       CST_RMS
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x87       CST_IOS
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x88       CST_SNS_REP
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x89       CST_SNS_REB
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x8a       CST_ADDB2
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x8b       CST_CAS
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0x8c       CST_ISCS
                Extended state: inhibited (starting)
  [   online] 0x6e00000000000001:0x8d   ssu5
  [   online] 0x7200000000000001:0x9b     172.28.128.11@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x9c       CST_HA
  [   online] 0x7300000000000001:0x9d       CST_RMS
  [ starting] 0x7200000000000001:0x9e     172.28.128.11@tcp:12345:41:401 ioservice
  [ starting] 0x7300000000000001:0x9f       CST_RMS
  [ starting] 0x7300000000000001:0xa0       CST_IOS
  [ starting] 0x7300000000000001:0xa1       CST_SNS_REP
  [ starting] 0x7300000000000001:0xa2       CST_SNS_REB
  [ starting] 0x7300000000000001:0xa3       CST_ADDB2
  [ starting] 0x7300000000000001:0xa4       CST_CAS
  [ starting] 0x7300000000000001:0xa5       CST_ISCS
  [   online] 0x6e00000000000001:0xa6   ssu6
  [   online] 0x7200000000000001:0xb4     172.28.128.12@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0xb5       CST_HA
  [   online] 0x7300000000000001:0xb6       CST_RMS
  [quiescing] 0x7200000000000001:0xb7     172.28.128.12@tcp:12345:41:401 ioservice
  [inhibited] 0x7300000000000001:0xb8       CST_RMS
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0xb9       CST_IOS
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0xba       CST_SNS_REP
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0xbb       CST_SNS_REB
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0xbc       CST_ADDB2
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0xbd       CST_CAS
                Extended state: inhibited (starting)
  [inhibited] 0x7300000000000001:0xbe       CST_ISCS
                Extended state: inhibited (starting)
  [   online] 0x6e00000000000001:0xbf   ssu7
  [   online] 0x7200000000000001:0xcd     172.28.128.9@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0xce       CST_HA
  [   online] 0x7300000000000001:0xcf       CST_RMS
  [ starting] 0x7200000000000001:0xd0     172.28.128.9@tcp:12345:41:401 ioservice
  [ starting] 0x7300000000000001:0xd1       CST_RMS
  [ starting] 0x7300000000000001:0xd2       CST_IOS
  [ starting] 0x7300000000000001:0xd3       CST_SNS_REP
  [ starting] 0x7300000000000001:0xd4       CST_SNS_REB
  [ starting] 0x7300000000000001:0xd5       CST_ADDB2
  [ starting] 0x7300000000000001:0xd6       CST_CAS
  [ starting] 0x7300000000000001:0xd7       CST_ISCS
16:28 vagrant@cmu:halon$ pdsh -w cmu.local,ssu[1-7].local,client1.local sudo journalctl -u halond --since today | grep entry | sort -k 4 | tail -1
ssu5: Jan 31 16:28:03 ssu5 halond[7996]: Thu Jan 31 16:28:03 UTC 2019 pid://172.28.128.11:9070:0:528: ha_entrypoint: succeeded: SpielAddress {sa_confds_fid = [0x7300000000000001:0x4e,0x7300000000000001:0x6a,0x7300000000000001:0x32], sa_confds_ep = ["172.28.128.8@tcp:12345:44:101","172.28.128.7@tcp:12345:44:101","172.28.128.3@tcp:12345:44:101"], sa_rm_fid = 0x7300000000000001:0x4f, sa_rm_ep = "172.28.128.8@tcp:12345:44:101", sa_quorum = 2}
16:28 vagrant@cmu:halon$
16:39 vagrant@cmu:halon$ pdsh -w cmu.local,ssu[1-7].local,client1.local sudo journalctl -u halond --since today | grep entry | sort -k 4 | tail -1
ssu6: Jan 31 16:34:01 ssu6 halond[8024]: Thu Jan 31 16:34:01 UTC 2019 pid://172.28.128.12:9070:0:540: ha_entrypoint: succeeded: SpielAddress {sa_confds_fid = [0x7300000000000001:0x4e,0x7300000000000001:0x6a,0x7300000000000001:0x32], sa_confds_ep = ["172.28.128.8@tcp:12345:44:101","172.28.128.7@tcp:12345:44:101","172.28.128.3@tcp:12345:44:101"], sa_rm_fid = 0x7300000000000001:0x4f, sa_rm_ep = "172.28.128.8@tcp:12345:44:101", sa_quorum = 2}
16:39 vagrant@cmu:halon$
16:39 vagrant@cmu:halon$
16:39 vagrant@cmu:halon$ hctl mero status | more
Cluster disposition: ONLINE
  cluster info:
    SNS pool:   0x6f00000000000001:0xd8 "default"
    DIX pool:   0x6f00000000000001:0x12d
    profile:    0x7000000000000001:0x153
    Filesystem stats:
      Total space:    90,194,313,216
      Free space:     90,194,313,216
      Total segments:  3,765,216,448
      Free segments:   3,763,334,328

Hosts:
  [   failed] 0x6e00000000000001:0xe    client1
                Extended state: failed(recoverable)
  [inhibited] 0x7200000000000001:0xf      172.28.128.13@tcp:12345:34:101 halon
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x10       CST_HA
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x11       CST_RMS
                Extended state: inhibited (online)
  [inhibited] 0x7200000000000001:0x12     172.28.128.13@tcp:12345:41:301 m0t1fs
                Extended state: inhibited (online)
  [inhibited] 0x7300000000000001:0x13       CST_RMS
                Extended state: inhibited (online)
  [   online] 0x6e00000000000001:0x14   cmu
  [   online] 0x7200000000000001:0x15     172.28.128.5@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x16       CST_HA
  [   online] 0x7300000000000001:0x17       CST_RMS
  [      N/A] 0x7200000000000001:0x18     172.28.128.5@tcp:12345:41:302 clovis-app
  [      N/A] 0x7300000000000001:0x19       CST_RMS
  [      N/A] 0x7200000000000001:0x1a     172.28.128.5@tcp:12345:41:303 clovis-app
  [      N/A] 0x7300000000000001:0x1b       CST_RMS
  [      N/A] 0x7200000000000001:0x1c     172.28.128.5@tcp:12345:41:304 clovis-app
  [      N/A] 0x7300000000000001:0x1d       CST_RMS
  [      N/A] 0x7200000000000001:0x1e     172.28.128.5@tcp:12345:41:305 clovis-app
  [      N/A] 0x7300000000000001:0x1f       CST_RMS
  [   online] 0x6e00000000000001:0x20   ssu1
  [   online] 0x7200000000000001:0x2e     172.28.128.3@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x2f       CST_HA
  [   online] 0x7300000000000001:0x30       CST_RMS
  [   online] 0x7200000000000001:0x31     172.28.128.3@tcp:12345:44:101 confd
  [   online] 0x7300000000000001:0x32       CST_CONFD
  [   online] 0x7300000000000001:0x33       CST_RMS
  [   online] 0x7200000000000001:0x34     172.28.128.3@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0x35       CST_RMS
  [   online] 0x7300000000000001:0x36       CST_IOS
  [   online] 0x7300000000000001:0x37       CST_SNS_REP
  [   online] 0x7300000000000001:0x38       CST_SNS_REB
  [   online] 0x7300000000000001:0x39       CST_ADDB2
  [   online] 0x7300000000000001:0x3a       CST_CAS
  [   online] 0x7300000000000001:0x3b       CST_ISCS
  [   online] 0x6e00000000000001:0x3c   ssu2
  [   online] 0x7200000000000001:0x4a     172.28.128.8@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x4b       CST_HA
  [   online] 0x7300000000000001:0x4c       CST_RMS
  [   online] 0x7200000000000001:0x4d     172.28.128.8@tcp:12345:44:101 confd
  [   online] 0x7300000000000001:0x4e       CST_CONFD
  [   online] 0x7300000000000001:0x4f       CST_RMS
  [   online] 0x7200000000000001:0x50     172.28.128.8@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0x51       CST_RMS
  [   online] 0x7300000000000001:0x52       CST_IOS
  [   online] 0x7300000000000001:0x53       CST_SNS_REP
  [   online] 0x7300000000000001:0x54       CST_SNS_REB
  [   online] 0x7300000000000001:0x55       CST_ADDB2
  [   online] 0x7300000000000001:0x56       CST_CAS
  [   online] 0x7300000000000001:0x57       CST_ISCS
  [   online] 0x6e00000000000001:0x58   ssu3
  [   online] 0x7200000000000001:0x66     172.28.128.7@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x67       CST_HA
  [   online] 0x7300000000000001:0x68       CST_RMS
  [   online] 0x7200000000000001:0x69     172.28.128.7@tcp:12345:44:101 confd
  [   online] 0x7300000000000001:0x6a       CST_CONFD
  [   online] 0x7300000000000001:0x6b       CST_RMS
  [   online] 0x7200000000000001:0x6c     172.28.128.7@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0x6d       CST_RMS
  [   online] 0x7300000000000001:0x6e       CST_IOS
  [   online] 0x7300000000000001:0x6f       CST_SNS_REP
  [   online] 0x7300000000000001:0x70       CST_SNS_REB
  [   online] 0x7300000000000001:0x71       CST_ADDB2
  [   online] 0x7300000000000001:0x72       CST_CAS
  [   online] 0x7300000000000001:0x73       CST_ISCS
  [   online] 0x6e00000000000001:0x74   ssu4
  [   online] 0x7200000000000001:0x82     172.28.128.10@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x83       CST_HA
  [   online] 0x7300000000000001:0x84       CST_RMS
  [   online] 0x7200000000000001:0x85     172.28.128.10@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0x86       CST_RMS
  [   online] 0x7300000000000001:0x87       CST_IOS
  [   online] 0x7300000000000001:0x88       CST_SNS_REP
  [   online] 0x7300000000000001:0x89       CST_SNS_REB
  [   online] 0x7300000000000001:0x8a       CST_ADDB2
  [   online] 0x7300000000000001:0x8b       CST_CAS
  [   online] 0x7300000000000001:0x8c       CST_ISCS
  [   online] 0x6e00000000000001:0x8d   ssu5
  [   online] 0x7200000000000001:0x9b     172.28.128.11@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0x9c       CST_HA
  [   online] 0x7300000000000001:0x9d       CST_RMS
  [   online] 0x7200000000000001:0x9e     172.28.128.11@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0x9f       CST_RMS
  [   online] 0x7300000000000001:0xa0       CST_IOS
  [   online] 0x7300000000000001:0xa1       CST_SNS_REP
  [   online] 0x7300000000000001:0xa2       CST_SNS_REB
  [   online] 0x7300000000000001:0xa3       CST_ADDB2
  [   online] 0x7300000000000001:0xa4       CST_CAS
  [   online] 0x7300000000000001:0xa5       CST_ISCS
  [   online] 0x6e00000000000001:0xa6   ssu6
  [   online] 0x7200000000000001:0xb4     172.28.128.12@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0xb5       CST_HA
  [   online] 0x7300000000000001:0xb6       CST_RMS
  [   online] 0x7200000000000001:0xb7     172.28.128.12@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0xb8       CST_RMS
  [   online] 0x7300000000000001:0xb9       CST_IOS
  [   online] 0x7300000000000001:0xba       CST_SNS_REP
  [   online] 0x7300000000000001:0xbb       CST_SNS_REB
  [   online] 0x7300000000000001:0xbc       CST_ADDB2
  [   online] 0x7300000000000001:0xbd       CST_CAS
  [   online] 0x7300000000000001:0xbe       CST_ISCS
  [   online] 0x6e00000000000001:0xbf   ssu7
  [   online] 0x7200000000000001:0xcd     172.28.128.9@tcp:12345:34:101 halon
  [   online] 0x7300000000000001:0xce       CST_HA
  [   online] 0x7300000000000001:0xcf       CST_RMS
  [   online] 0x7200000000000001:0xd0     172.28.128.9@tcp:12345:41:401 ioservice
  [   online] 0x7300000000000001:0xd1       CST_RMS
  [   online] 0x7300000000000001:0xd2       CST_IOS
  [   online] 0x7300000000000001:0xd3       CST_SNS_REP
  [   online] 0x7300000000000001:0xd4       CST_SNS_REB
  [   online] 0x7300000000000001:0xd5       CST_ADDB2
  [   online] 0x7300000000000001:0xd6       CST_CAS
  [   online] 0x7300000000000001:0xd7       CST_ISCS
16:39 vagrant@cmu:halon$
1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

Yes.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

The idea was to use the same code for getting the principal RM.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

You mean that it's not possible for online process to host offline service? Yeah, this makes sense...

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

Yes, this name will do. 👌

Notes:

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

Or just getPrincipalRM' (Haskell way)?

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

Done.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

I like it with the case, will do. How about getPrincipalRMfrom rg variant?

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: andriytk

  1. I like it, will do. Thanks.
  2. There is a check for the parent process to be Online, so I'd prefer to not touch it for now.