eclipse-paho / paho.mqtt.java

Eclipse Paho Java MQTT client library. Paho is an Eclipse IoT project.
https://eclipse.org/paho
Other
2.15k stars 893 forks source link

Too many connections / short SO_TIMEOUT / client disconnects often #637

Open tomekjarosik opened 5 years ago

tomekjarosik commented 5 years ago

Please fill out the form below before submitting, thank you!

Version 1.2.1 introduced this change: https://github.com/eclipse/paho.mqtt.java/commit/d5379c3477119f9bdd59069271ddd21efb8c9628#diff-4facee5fd67e0fc8ce39f657db10c617 @rehanshaukat

This means WebSocketReceiver.java will get java.net.SocketTimeoutException when trying to read https://github.com/eclipse/paho.mqtt.java/blob/57c2c1e9830d8a8036bbb2513e22738c121a254f/org.eclipse.paho.client.mqttv3/src/main/java/org/eclipse/paho/client/mqttv3/internal/websocket/WebSocketReceiver.java#L103.

The exception will be caught later and the thread will be forced to stop: https://github.com/eclipse/paho.mqtt.java/blob/57c2c1e9830d8a8036bbb2513e22738c121a254f/org.eclipse.paho.client.mqttv3/src/main/java/org/eclipse/paho/client/mqttv3/internal/websocket/WebSocketReceiver.java#L118

Because of that, the connection will often ends early, resulting with repeated connections attempts. And one client can create many more connection attempts than previous 1.2.0 version. The 1.2.1 puts much more pressure on the server-side. This can be reproduced with simple unit-test (pseudo-code):

 MQTTClient client = new MQTTClient(WEBSOCKET_URI).connect().subscribe("topic");
 assertTrue(client.isConnected());
 Thread.sleep(10000);
 assertTrue(client.isConnected()); // Fails in 1.2.1, works in 1.2.0

From the client perspective I'd like to have a way to provide custom SO_TIMEOUT. Another way to fix this could be to make SO_TIMEOUT longer, let's say same as connection timeout by default. Or maybe even socket read timeout exception should be treated differently.

Could you please review the code in the above links and discuss what's the proper solution should be?

kandrej commented 5 years ago

I think I have the same problem. When using version 1.2.1 to connect to a broker with ws:// or wss://, the client always disconnects after 1 second. Version 1.2.0 works fine.

I tried to increase the SO_TIMEOUT to 5 seconds, but the client just disconnects after 5 seconds instead of 1. I think it has something to do with the way WebSocket(Secure)NetworkModule use Input/Output streams

MMaiero commented 5 years ago

We have the same issue with Kura: we configure WebSocket connection, it connects and than it disconnects. It worked fine in 1.2.0.

Any update on this? Did you find a workaround?

kandrej commented 5 years ago

I think PR #655 should fix the problem

cdealti commented 5 years ago

@kandrej I think we should clarify with @rehanshaukat the intention of adding the socket timeout. We have successfully used 1.2.0 in production without any of the issues mentioned in the commit log.

There is no socket read timeout set currently and default is to set to zero i.e. wait forever. Adding socket read timeout will help recovering from edge cases where packets are lost on the wire and MqttInputStream.readMqttWireMessage() fails to notify ClientState.

The MQTT PING message is used exactly for this purpose and works as expected after reverting the commit.

In fact, readMqttWireMessage does handle the case for exception thrown due to socket read timeout and ignore it.

Do you mean this?

I'm the author of that change which was added as a result of #28.

kandrej commented 5 years ago

@cdealti I agree we should clarify. Please consider adding a comment in the code in case the socket timeout is not needed as the history of the file shows it was added and removed at least 2 times.

However, note that in the current state readMqttWireMessage will not catch the socket read timeout because it reads from a PipedInputStream, which has been closed as the result of an exception described by @tomekjarosik

rehanshaukat commented 5 years ago

Sorry for replying very late on this issue.

I haven't used 1.2.1 in production yet. Bit of context: we use https://github.com/aws/aws-iot-device-sdk-java to connect to aws IoT services. AWS IoT SDK use old version of paho 1.1.0. Random IoT stopped working issues were noticed when using WebSocket connections. While reading paho codebase, I spotted that socket timeout was set to unlimited which did not look sensible and also the old commit where socket read timeout was removed did not explain why it was removed so I added the socket timeout back and raised a PR. We very recently started using paho 1.2.0 at a small level in production which is giving better results (not 100% confirmed that the issue we were seeing in 1.1.0 is fixed completely).

I think SocketTimeoutException should not make the thread stop. If it is stoping the receiver then #655 should fix this.

cdealti commented 5 years ago

@kandrej the socket read timeout was added with e71753673c0f3dadbb298392b338e7452c91a35a to support SSL session resumption #28. This contribution took one year and a half to be merged in the codebase and released with v1.1.1. The contribution was meant for MQTT over TCP or SSL (not Websockets) and came with a test.

The socket timeout was removed with bbd6dc730cc88b09226086998735cfc5867582fc without any comment and, I think, defeating the whole purpose of the the above contribution. This was released with v1.2.0.

@rehanshaukat re-added the one second socket timeout causing this Websockets issue with v1.2.1 (so I think we need a regression test for this).

Finally #655 ignores the timeout exception. However, I think Websockets are now broken again as WebSocketFrame expects to read the full frame from the socket without any timeout which might not be always the case with large payloads and unreliable links like cellular links. I think we need a test for this