Seagate / halon

High availability solution
Apache License 2.0
1 stars 0 forks source link

HALON-900: fix multi-nodes TS cluster bootstrap regression #1562

Closed andriytk closed 5 years ago

andriytk commented 5 years ago

The same lease timeout which we increased in the commit 8d71b602 is also used by the ambassador to ping the replicas and determine the leader. So the leader determination increased as well along with the RC startup time. As a result, the satellites startup timed out during the bootstrap.

Now we just hard-code the initial ambassador timeout to 1 sec. It seems to be enough to quickly and reliably determine the leader during the bootstrap and also break the dependency of this determination process on the lease timeout configuration.

andriytk commented 5 years ago

merged

mssawant commented 5 years ago

Okay. Looks good to me.

andriytk commented 5 years ago

Lease timeout was increased because the io sometimes cannot complete within the timeout - yes. As a result, PAXOS cannot complete and the RC restarts. You could see a lot of such restarts in halon decision log and system log before the commit 8d71b602 here in the GitLab tests artefacts.

mssawant commented 5 years ago

Okay, so earlier the lease timeout was increased because the halon keepalives were not processed in given time due to heavy io load is it?

andriytk commented 5 years ago

assigned to @mandar.sawant

andriytk commented 5 years ago

@mandar.sawant could you review it, please?

andriytk commented 5 years ago

added 1 commit

Compare with previous version

andriytk commented 5 years ago

added 1 commit

Compare with previous version

andriytk commented 5 years ago

changed title from HALON-900: fix multi-nodes TS cluster bootstrap to HALON-900: fix multi-nodes TS cluster bootstrap{+ regression+}

andriytk commented 5 years ago

added 1 commit

Compare with previous version

andriytk commented 5 years ago

added 1 commit

Compare with previous version

andriytk commented 5 years ago

changed the description