eclipse / paho.mqtt.c

An Eclipse Paho C client library for MQTT for Windows, Linux and MacOS. API documentation: https://eclipse.github.io/paho.mqtt.c/
https://eclipse.org/paho
Other
1.94k stars 1.08k forks source link

Automatic reconnect fails after long period disconnected #1007

Closed alex-gimenez closed 3 years ago

alex-gimenez commented 3 years ago

Describe the bug Using the Async API to create a tcp ssl connection to a broker, enabling the auto reconnect feature and letting the system disconnected from the network for some hours (+10h) will prevent the reconnect function to work properly. This does not happen (or I couldn't reproduce it) with plain tcp connections.

To Reproduce To reproduce it:

I've traced the problem down to MQTTAsync_connecting (see screenshots). There the parameter MQTTVersion is 0, producing that MQTTPacket_send fails without even sending anything over the wire. MQTTVersion should be "4" corresponding to version 3.1.1. It shouldn't be 0, so somewhere there must be some memory overwrite? I'm not sure when or where it's been reset.

Expected behavior It should always reconnect, no matter how much time it doesn't have connection.

Screenshots MQTTVersion_0 MQTTPacket_send_connect MQTTVersion_0_(2)

Environment (please complete the following information):

Additional context Destroying the client and creating a new one will solve the problem but then the automatic reconnect can't be trusted for long periods without connection. This is a fragment of the log where you can see MQTTPacket_send_connect failing (because of the MQTTVersion being 0) image

This one is the same system but at the first connection (succeeding) image

icraggs commented 3 years ago
alex-gimenez commented 3 years ago

The IDE is Windows, but it's connecting to a vxworks system which is running the application. The library was built statically and I'm using my own application on top (which is the interface between the machine's user and paho static library). This application just receives connection parameters to populate paho variables and creates a thread which will run as an infinite loop. This thread is using MQTTAsync functions.

For this case, I just call once MQTTAsync_connect and wait. Then the callback functions are called when they should (logs, connection failure, and so on). I can try with 1.3.7 and post the results, but it would be nice to know if this is a problem originated in this platform or can also be reproduced in a more general OS (linux). Unfortunately, it's not possible for me to do those tests on a Linux machine.

alex-gimenez commented 3 years ago

I've run tests with version 1.3.7 but the problem is still there. The behaviour I observed is the same (MQTTVersion becomes 0 somewhere and forces MQTTPacket_send_connect to fail )

MQTTVersion_0_v137

The only change I observed is that "onConnectFailure" callback was being called in version 1.3.6 and now is "connectionlost", which I think makes more sense. That change seems related to https://github.com/eclipse/paho.mqtt.c/issues/974

alex-gimenez commented 3 years ago

I did a PR that has fixed the issue for me. I'm not sure if there could be other situations where something similar is still happening, but I can't reproduce it anymore. https://github.com/eclipse/paho.mqtt.c/pull/1012

icraggs commented 3 years ago

Are you using the serverURIs field in your connect? If so, with how many entries?

alex-gimenez commented 3 years ago

I'm just using MQTTAsync_createWithOptions giving a string with the hostname + port. "ssl://test.mosquitto.org:8883"

icraggs commented 3 years ago

Thanks. It's the "after a long period disconnected" that I don't understand. I'd like to know what the cause is so that there aren't any unexpected side effects of a fix.

icraggs commented 3 years ago

I just had a thought which is if you set the MQTTVersion field in the connect options to MQTTVERSION_3_1_1 rather than MQTTVERSION_DEFAULT, this could distinguish between a memory overwrite (it still gets set to 0) and the field being wrongly initialized or not at all.

Also, as you have statically linked the library, this means that the memory is writeable by your application as well as the library, doesn't it?

Other checks would be, does the same issue occur if:

a) the application is pointed at a non-existent address, so fails to connect from the start b) the broker being connected to is taken down, rather than the network cable being unplugged

I'm going to see if I can reproduce but if it's particular to your environment for some reason, then that wouldn't work. When I say try to reproduce, I mean by leaving "for a long time" - I tried several shorter periods already and didn't see the problem. But "a long time" is obviously not specific.

icraggs commented 3 years ago

I tried an overnight test on my Ubuntu system using the sample paho-c-sub, stopping a local broker then restarting it in the morning and the reconnect worked.

jumoog commented 3 years ago

Maybe its a DNS Problem?

icraggs commented 3 years ago

My thoughts are:

  1. Due to environment - VxWorks?
  2. Static linking allows application to overwrite memory?
  3. Exactly specific combination of parameters?

I used the paho_c_sub sample to ssl://test.mosquitto.org:8883 overnight, disconnecting the network, and it reconnected ok in the morning.

I'm tempted to put the log message to catch this potential situation into 1.3.9 (see the PR) and then see if anyone else experiences it in practice.

fpagliughi commented 3 years ago

I've had some reports of this with the Rust client wrapper, running on Embedded Linux, but have not been able to reproduce it myself yet.

icraggs commented 3 years ago

I've put a change in which includes a trace message which should be written if occurs. Might help to diagnose.

alex-gimenez commented 3 years ago

I'm sorry I couldn't reply sooner. I don't have access to those devices anymore so I won't be able to do more tests but I'll try to answer your questions:

  1. Due to environment - VxWorks? I didn't try reproducing it in any other system so maybe... but we don't know

  2. Static linking allows application to overwrite memory? This shouldn't happen, although it's technically possible doing it. However, this could eventually also produce a system crash, since it will be due to programming errors.

  3. Exactly specific combination of parameters? The client was using reconnect, keep alive every 30s, tcp ssl connection, clean session. I can't remember all the other parameters, sorry.

Maybe with that trace message other people will also report it, thanks for your work

linxingyang commented 3 years ago

Hi. Seems i met the similar problem.

Test: both cable and 4G are available first, mqtt connected by cable, then plug off cable, waiting for mqtt connected by 4G, then plug in cable, waiting for mqtt connected by cable again ...... after do that few times(sometimes longer), the mqtt is block. i found

so, i set only using mqtt4, and that problem disappear.

mqttConnOpts.MQTTVersion = MQTTVERSION_3_1_1;

The mqtt server is mosquitto 2.0.11, and using mosquitto_pub/sub with MQTT3 to test the server, it's work fine with MQTT3.

And now, The same problem happened again when only using mqtt4, i guess/suspect that because frequently connect and disconnect with netcard switched, leding the connect callback not triggered.


I'm trying to figure out the scence and what happened now. I find this issue under paho.mqtt.c 1.3.9 milestone, so i will base on the newest version to do the same test.