Closed alex-gimenez closed 3 years ago
The IDE is Windows, but it's connecting to a vxworks system which is running the application. The library was built statically and I'm using my own application on top (which is the interface between the machine's user and paho static library). This application just receives connection parameters to populate paho variables and creates a thread which will run as an infinite loop. This thread is using MQTTAsync functions.
For this case, I just call once MQTTAsync_connect and wait. Then the callback functions are called when they should (logs, connection failure, and so on). I can try with 1.3.7 and post the results, but it would be nice to know if this is a problem originated in this platform or can also be reproduced in a more general OS (linux). Unfortunately, it's not possible for me to do those tests on a Linux machine.
I've run tests with version 1.3.7 but the problem is still there. The behaviour I observed is the same (MQTTVersion becomes 0 somewhere and forces MQTTPacket_send_connect to fail )
The only change I observed is that "onConnectFailure" callback was being called in version 1.3.6 and now is "connectionlost", which I think makes more sense. That change seems related to https://github.com/eclipse/paho.mqtt.c/issues/974
I did a PR that has fixed the issue for me. I'm not sure if there could be other situations where something similar is still happening, but I can't reproduce it anymore. https://github.com/eclipse/paho.mqtt.c/pull/1012
Are you using the serverURIs field in your connect? If so, with how many entries?
I'm just using MQTTAsync_createWithOptions giving a string with the hostname + port. "ssl://test.mosquitto.org:8883"
Thanks. It's the "after a long period disconnected" that I don't understand. I'd like to know what the cause is so that there aren't any unexpected side effects of a fix.
I just had a thought which is if you set the MQTTVersion field in the connect options to MQTTVERSION_3_1_1 rather than MQTTVERSION_DEFAULT, this could distinguish between a memory overwrite (it still gets set to 0) and the field being wrongly initialized or not at all.
Also, as you have statically linked the library, this means that the memory is writeable by your application as well as the library, doesn't it?
Other checks would be, does the same issue occur if:
a) the application is pointed at a non-existent address, so fails to connect from the start b) the broker being connected to is taken down, rather than the network cable being unplugged
I'm going to see if I can reproduce but if it's particular to your environment for some reason, then that wouldn't work. When I say try to reproduce, I mean by leaving "for a long time" - I tried several shorter periods already and didn't see the problem. But "a long time" is obviously not specific.
I tried an overnight test on my Ubuntu system using the sample paho-c-sub, stopping a local broker then restarting it in the morning and the reconnect worked.
Maybe its a DNS Problem?
My thoughts are:
I used the paho_c_sub sample to ssl://test.mosquitto.org:8883 overnight, disconnecting the network, and it reconnected ok in the morning.
I'm tempted to put the log message to catch this potential situation into 1.3.9 (see the PR) and then see if anyone else experiences it in practice.
I've had some reports of this with the Rust client wrapper, running on Embedded Linux, but have not been able to reproduce it myself yet.
I've put a change in which includes a trace message which should be written if occurs. Might help to diagnose.
I'm sorry I couldn't reply sooner. I don't have access to those devices anymore so I won't be able to do more tests but I'll try to answer your questions:
Due to environment - VxWorks? I didn't try reproducing it in any other system so maybe... but we don't know
Static linking allows application to overwrite memory? This shouldn't happen, although it's technically possible doing it. However, this could eventually also produce a system crash, since it will be due to programming errors.
Exactly specific combination of parameters? The client was using reconnect, keep alive every 30s, tcp ssl connection, clean session. I can't remember all the other parameters, sorry.
Maybe with that trace message other people will also report it, thanks for your work
Hi. Seems i met the similar problem.
Test: both cable and 4G are available first, mqtt connected by cable, then plug off cable, waiting for mqtt connected by 4G, then plug in cable, waiting for mqtt connected by cable again ...... after do that few times(sometimes longer), the mqtt is block. i found
so, i set only using mqtt4, and that problem disappear.
mqttConnOpts.MQTTVersion = MQTTVERSION_3_1_1;
The mqtt server is mosquitto 2.0.11, and using mosquitto_pub/sub with MQTT3 to test the server, it's work fine with MQTT3.
And now, The same problem happened again when only using mqtt4, i guess/suspect that because frequently connect and disconnect with netcard switched, leding the connect callback not triggered.
I'm trying to figure out the scence and what happened now. I find this issue under paho.mqtt.c 1.3.9 milestone, so i will base on the newest version to do the same test.
Describe the bug Using the Async API to create a tcp ssl connection to a broker, enabling the auto reconnect feature and letting the system disconnected from the network for some hours (+10h) will prevent the reconnect function to work properly. This does not happen (or I couldn't reproduce it) with plain tcp connections.
To Reproduce To reproduce it:
I've traced the problem down to MQTTAsync_connecting (see screenshots). There the parameter MQTTVersion is 0, producing that MQTTPacket_send fails without even sending anything over the wire. MQTTVersion should be "4" corresponding to version 3.1.1. It shouldn't be 0, so somewhere there must be some memory overwrite? I'm not sure when or where it's been reset.
Expected behavior It should always reconnect, no matter how much time it doesn't have connection.
Screenshots
Environment (please complete the following information):
Additional context Destroying the client and creating a new one will solve the problem but then the automatic reconnect can't be trusted for long periods without connection. This is a fragment of the log where you can see MQTTPacket_send_connect failing (because of the MQTTVersion being 0)
This one is the same system but at the first connection (succeeding)