eclipse / paho.mqtt.golang

Other
2.73k stars 533 forks source link

fix: fix keep-alive timeouts on small intervals #667

Closed lefinal closed 7 months ago

lefinal commented 7 months ago

The check interval calculation in the ping.go file has been adjusted for when the keep-alive time is less than or equal to 10. Instead of dividing by 2, the check interval now is calculated with a division by 4, ensuring a more frequent ping check. This is required as MQTT brokers typically detect a timeout if the client did not send a ping within the specified keep-alive interval times 1.5. If the client only checks for due pings every half-interval, this commonly leads to timeouts. An example: a 5 second interval means that every 2.5 seconds checks are made for due pings. However, if checks have occurred at 2.49 seconds since the last ping and 4.99 seconds, the next one might happen slightly after 7.5 seconds, making the broker detect a timeout. Therefore, we increase the frequency of ping checks by reducing the check interval.

lefinal commented 7 months ago

Oh and in case this isn't already clear: This might be considered a critical bug as the library is nearly unusable with keep-alive timeouts below 10-11 seconds. So I'd be very happy if this is released as soon as possible (: Thanks guys!

MattBrittan commented 7 months ago

Thanks for the fix.

MQTT brokers typically detect a timeout if the client did not send a ping within the specified keep-alive interval times 1.5.

This is mandated in the spec "...within one and a half times the Keep Alive time period, it MUST disconnect...".

An example: a 5 second interval means that every 2.5 seconds checks are made for due pings. However, if checks have occurred at 2.49 seconds since the last ping and 4.99 seconds, the next one might happen slightly after 7.5 seconds, making the broker detect a timeout. Therefore, we increase the frequency of ping checks by reducing the check interval.

This is likely to be fairly common given that pings are sent/checked on the same ticker i.e.

0.00 - timer fired
0.01 - ping sent (triggered by same timer so generally happens quickly)
2.50 - timer fired (no ping due; assume response received)
5.00 - timer fired (no ping due - 5.00 - 0.01 = 4.99)
7.50 - Timer fired (ping due; but this is now 7.49s after the last ping, so quite likely broker will drop connection)

This might be considered a critical bug

The algorithm (with minor changes) has been in place since 2017, so this has obviously not been too much of an issue (I suspect that very few users require such frequent pings). As such, I'm accepting the PR but don't intend to issue a release because of it (you can just use go get github.com/eclipse/paho.mqtt.golang@master for the time being).

Note that the v5 client uses a different approach ,that I believe will result in more accurate checks. It might be worth considering moving to that model here (but the code there is pretty new so no where near as well tested as this).