Seagate / halon

High availability solution
Apache License 2.0
1 stars 0 forks source link

HALON-892: fix cluster stop timeout during SNS repair/rebalance #1560

Closed andriytk closed 5 years ago

andriytk commented 5 years ago

The problem was that we tried to execute the delayed abort of SNS repair/rebalance when the cluster was already stopped and no m0d processes were available already (including m0d-confd).

Now we don't schedule the abort operation at all if the cluster is not Online. Mero will abort SNS operation by itself during cluster stop.

andriytk commented 5 years ago

merged

andriytk commented 5 years ago

enabled an automatic merge when the pipeline for 10a98b5d6b89ed7804d34a7db706573aeb388fec succeeds

chumakd commented 5 years ago

added 7 commits

Compare with previous version

andriytk commented 5 years ago

assigned to @andriy.tkachuk

mssawant commented 5 years ago

okay, looks fine to me.

mssawant commented 5 years ago

resolved all discussions

mssawant commented 5 years ago

Okay.

mssawant commented 5 years ago

Okay.

andriytk commented 5 years ago

The same - we are repairing, but get the wrong rebalance msg. Seems like this. I think such situations should not happen in general at all.

BTW, it is the old code which I don't touch, only fixing the debug msg.

andriytk commented 5 years ago

It means that we've got the wrong repair msg during rebalance. I believe so.

mssawant commented 5 years ago

Similarly, not sure about the log message.

mssawant commented 5 years ago

not sure what the log message mean, "Got M0_NC_REPAIR but pools is rebalancing now." Does it mean there was a failure during rebalance?

andriytk commented 5 years ago

assigned to @mandar.sawant

andriytk commented 5 years ago

added 4 commits

Compare with previous version

andriytk commented 5 years ago

changed the description

andriytk commented 5 years ago

unmarked as a Work In Progress

andriytk commented 5 years ago

@mandar.sawant please review.