bytebeamio / rumqtt

The MQTT ecosystem in rust
Apache License 2.0
1.6k stars 247 forks source link

Connection lost to reconnect on Network error is always about 2 minutes? #295

Open korken89 opened 3 years ago

korken89 commented 3 years ago

Hi,

When testing how a listener based on this library reacts to connection losses, I have noticed that I cannot configure the timeout associated to network timeouts. It reconnects fine, but it always takes about 130 seconds as the following log shows:

2021-06-30 15:34:41,744 DEBUG [rumqttc::state] Pingreq,
            last incoming packet before 3945 millisecs,
2021-06-30 15:34:46,745 TRACE [mio::poll] deregistering event source from poller
2021-06-30 15:34:46,745 DEBUG [mqtt_logger] MQTT error, will try to reconnect when possible: AwaitPingResp
2021-06-30 15:34:46,745 DEBUG [rustls::anchors] add_pem_file processed 1 valid and 0 invalid certs
2021-06-30 15:36:56,931 TRACE [mio::poll] deregistering event source from poller
2021-06-30 15:36:56,931 DEBUG [mqtt_logger] Network error, will try to reconnect when possible: Io(Os { code: 110, kind: TimedOut, message: "Connection timed out" })

Is there a way to reduce this timeout? For example using the tool mosquitto_sub reconnects after about 10 seconds.

korken89 commented 3 years ago

After more testing it seems like the delay is (mqttoptions.set_keep_alive + 1min) * 2

tekjar commented 3 years ago

Can I know how you are simulating disconnections and what's the keep alive you are using? If you are disconnecting your wifi or pulling your ethernet cable, tcp will just assume that this is congestion. Rumqttc detecting idleness in keep alive time will trigger ping and waits for ping response. Before sending next ping, if previous ping hasn't received a response, client will assume that this is a half open connection and then trigger reconnection. So this typically takes upto 2 keep alive times.

korken89 commented 3 years ago

Hi @tekjar thanks for you response!

I have set mqttoptions.set_keep_alive(5), and I have tested this by connecting and disconnecting from my network using a VPN. The 2 keep alive times you refer to are within the time, but this extra minute comes from something else. I cloned rumqttc and played around a bit, and it seems like the TcpStream calls via tokio need 1 minute to timeout.

If you have ideas, I'm happy to give them a try! As we're not seeing the issue with mosquitto_sub I think there is some case being missed right now.

tekjar commented 3 years ago

Let me try this as well. May I know which broker you are using?

korken89 commented 3 years ago

I'm testing against mosquitto from the Arch repositories (systemctl start mosquitto). You need to disable the VPN about 10s-20s, if one does it faster this effect does not happen.

korken89 commented 3 years ago

If you want me to log something, or screen capture, or ... just give me a shout!

tekjar commented 3 years ago

Sure! I'll let you know if I'm not able to reproduce this

datdenkikniet commented 3 years ago

Has any progress been made on this issue? We're still seeing the same problem.

TotalKrill commented 3 years ago

Its similar, the 1minute timout is gone, so I assume that the ping responses are not handled in time? Or that they do not get time to get sent? We only see the issue when we have quite a high load on the MQTT client.

image

tekjar commented 3 years ago

Unfortunately I couldn't find time to RCA this

TotalKrill commented 3 years ago

I think I managed to mitigate this issue by reducing the time spent in handling the incoming Publishes.

I immediately send the incoming publishes over a channel to another thread, suggesting something is not 100% with the registering of the ping responses maybe?