Seagate / halon

High availability solution
Apache License 2.0
1 stars 0 forks source link

WIP: HALON-855: Inconsistent Mero drive status #1521

Closed 1468ca0b-2a64-4fb4-8e52-ea5806644b4c closed 5 years ago

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Add tentative fix.

TODO

vvv commented 5 years ago

closed

rajanikantchirmade commented 5 years ago

But it only change "PoolRepairStatus" not drive status.

rajanikantchirmade commented 5 years ago

marked the task abort of SNS operation should reset states of involved storage devices [context] as incomplete

rajanikantchirmade commented 5 years ago

marked the task abort of SNS operation should reset states of involved storage devices [context] as completed

vvv commented 5 years ago

changed the description

vvv commented 5 years ago

marked as a Work In Progress

vvv commented 5 years ago
  1. Where PoolRepairStatus used to be set before this change?

I've found out that in the original code (master) PoolRepairStatus is being set in mkRepairStartOperation.mkRepairOperationStarted.

Sorry for the noise.

rajanikantchirmade commented 5 years ago
  1. Where PoolRepairStatus used to be set before this change?

We did not changed code related to this. But we can check it on master by manually triggering abort using hctl command.

  1. Shouldn't similar change be made in ruleRebalanceStart?

Yes, you are right.

rajanikantchirmade commented 5 years ago

Yes, this fix need to cover rebalance too.

vvv commented 5 years ago

@rajanikant.chirmade Please answer the remaining comments and propose a solution for the “sdevs remain in SDSInhibited SDSRepairing state” problem.

vvv commented 5 years ago

added 1 commit

Compare with previous version

vvv commented 5 years ago

changed this line in version 7 of the diff

vvv commented 5 years ago

changed this line in version 7 of the diff

vvv commented 5 years ago

@rajanikant.chirmade ?

vvv commented 5 years ago

added 1 commit

Compare with previous version

vvv commented 5 years ago

changed this line in version 6 of the diff

vvv commented 5 years ago

changed this line in version 6 of the diff

vvv commented 5 years ago

changed this line in version 6 of the diff

vvv commented 5 years ago

[nit] Use blank lines to separate sections of code.

vvv commented 5 years ago
  1. Where PoolRepairStatus used to be set before this change?
  2. Shouldn't similar change be made in ruleRebalanceStart?
vvv commented 5 years ago

added 1 commit

Compare with previous version

vvv commented 5 years ago

changed this line in version 5 of the diff

vvv commented 5 years ago

changed this line in version 5 of the diff

vvv commented 5 years ago

changed this line in version 5 of the diff

vvv commented 5 years ago

Document.

vvv commented 5 years ago
                        , getConfObjState sdev rg `elem` [ M0_NC_REPAIR
                                                         , M0_NC_REBALANCE
                                                         ]

?

vvv commented 5 years ago

What about M0_NC_REBALANCE? I think we want to abort it as well. In fact, we want to abort any ongoing SNS operation.

vvv commented 5 years ago

Since abortRepairFromProc is called for any timed-out process (see ruleProcessKeepaliveReply), it is quite possible that proc does not host CST_IOS service. In this case pools will be null and Log.DEBUG "Repair not running" message will be printed.

Are we okay with that?

vvv commented 5 years ago

added 1 commit

Compare with previous version

vvv commented 5 years ago

changed this line in version 4 of the diff

vvv commented 5 years ago

changed this line in version 4 of the diff

vvv commented 5 years ago

changed this line in version 4 of the diff

vvv commented 5 years ago

Prelude. is not needed.

vvv commented 5 years ago

Suggestion: add a type signature.

    getProcs :: [(Fid, M0.TimeSpec)] -> Graph -> [(M0.Process, M0.TimeSpec)]
    getProcs fids rg = [ (p, t)
                       | (fid, t) <- fids
                       , Just p <- [M0.lookupConfObjByFid fid rg]
                       ]
vvv commented 5 years ago

[nit] Remove braces.

vvv commented 5 years ago

added 3 commits

Compare with previous version

vvv commented 5 years ago

assigned to @rajanikant.chirmade

vvv commented 5 years ago

@rajanikant.chirmade You've mentioned some problem with this patch. Can you describe what exactly is not working?

AFAIU, the $M0_CLUSTER file you test against looks like this:

clovis-apps: [ client1.local ]
confds:      [ cmu.local ]
ssus:
  - host:  ssu1.local
    disks: /dev/sd[b-h]
  - host:  ssu2.local
    disks: /dev/sd[b-h]

How do you test the patch? What happens? What is the expected behaviour?

vvv commented 5 years ago

uuid is unnecessary, we can do fine without auxiliary identifier.

vvv commented 5 years ago

fromJust is dangerous and should generally be avoided. It will throw runtime error if prs is Nothing.

vvv commented 5 years ago

Use explicit import list, i.e.

import           HA.RecoveryCoordinator.Castor.Drive.Rules.Repair (abortRepairFromProc)
vvv commented 5 years ago

[nit] Parentheses are not needed here.

vvv commented 5 years ago

added 6 commits

Compare with previous version

chumakd commented 5 years ago

assigned to @valery.vorotyntsev

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: rajanikantchirmade

Yes, this need to be fix. Cherry picked your patch. Thanks.

1468ca0b-2a64-4fb4-8e52-ea5806644b4c commented 5 years ago

Created by: vvv

What if several sdevs of the proc are being repaired (M0_NC_REPAIR)? What if two of those sdevs belong different pools?