In current protocol we're issuing indirect ping requests when initial ping fails.
For this indirect pings we use a timeout calculated with static timeout * LHA multiplier on ping issuer node.
Indirect pings are using a timeout calculated with a same formula but on the indirect pinger nodes. We expect indirect pinger nodes to send nack if their ping attempt fails. In reality, if LHA multiplier is the same or higher on indirect pinger node, a nack will be issued only after pingReq timeout on initial pinger node expires, which makes LHA multiplier effectively useless.
In order to fix it we can't just apply a shortening multiplier on indirect prober node, because LHA multiplier on that node can be higher than on initial pinger node.
We may either implement something similar, or go with a solution suggested in a paper and further scale down pingReq timeout to 80% from initial ping timeout.
Reported by @avolokhov
In current protocol we're issuing indirect ping requests when initial ping fails.
For this indirect pings we use a timeout calculated with static timeout * LHA multiplier on ping issuer node. Indirect pings are using a timeout calculated with a same formula but on the indirect pinger nodes. We expect indirect pinger nodes to send nack if their ping attempt fails. In reality, if LHA multiplier is the same or higher on indirect pinger node, a nack will be issued only after pingReq timeout on initial pinger node expires, which makes LHA multiplier effectively useless. In order to fix it we can't just apply a shortening multiplier on indirect prober node, because LHA multiplier on that node can be higher than on initial pinger node.
Original: see footnote [5] on page 5 of lifeguard paper: https://arxiv.org/pdf/1707.00788.pdf
Memberlist implementation has scaled probe timeouts for direct ping ack/nack: https://github.com/hashicorp/memberlist/blob/c192837f8fd6d494ac641880d1356804b21503a3/state.go#L305
and unscaled for indirect ping requests: https://github.com/hashicorp/memberlist/blob/c192837f8fd6d494ac641880d1356804b21503a3/net.go#L584
We may either implement something similar, or go with a solution suggested in a paper and further scale down pingReq timeout to 80% from initial ping timeout.