apple / swift-cluster-membership

Distributed Membership Protocol implementations in Swift
https://apple.github.io/swift-cluster-membership/
Apache License 2.0
201 stars 20 forks source link

timeout for a ping initiated by `pingReq` should be shorter than initiator's ping timeout #6

Closed ktoso closed 4 years ago

ktoso commented 4 years ago

Reported by @avolokhov

In current protocol we're issuing indirect ping requests when initial ping fails.

For this indirect pings we use a timeout calculated with static timeout * LHA multiplier on ping issuer node. Indirect pings are using a timeout calculated with a same formula but on the indirect pinger nodes. We expect indirect pinger nodes to send nack if their ping attempt fails. In reality, if LHA multiplier is the same or higher on indirect pinger node, a nack will be issued only after pingReq timeout on initial pinger node expires, which makes LHA multiplier effectively useless. In order to fix it we can't just apply a shortening multiplier on indirect prober node, because LHA multiplier on that node can be higher than on initial pinger node.

Original: see footnote [5] on page 5 of lifeguard paper: https://arxiv.org/pdf/1707.00788.pdf


Memberlist implementation has scaled probe timeouts for direct ping ack/nack: https://github.com/hashicorp/memberlist/blob/c192837f8fd6d494ac641880d1356804b21503a3/state.go#L305

and unscaled for indirect ping requests: https://github.com/hashicorp/memberlist/blob/c192837f8fd6d494ac641880d1356804b21503a3/net.go#L584

We may either implement something similar, or go with a solution suggested in a paper and further scale down pingReq timeout to 80% from initial ping timeout.