eclipse / paho.mqtt.c

An Eclipse Paho C client library for MQTT for Windows, Linux and MacOS. API documentation: https://eclipse.github.io/paho.mqtt.c/
https://eclipse.org/paho
Other
1.96k stars 1.1k forks source link

Segmentation faults during disconnect and failed reconnect attempt #385

Closed JuergenKosel closed 6 years ago

JuergenKosel commented 6 years ago

Hello,

while continuing testing for issue #373 a segmentation fault was caused in the following line: https://github.com/eclipse/paho.mqtt.c/blob/c85e4a2fd6e96ebc35caef99a024d94c9b664ba0/src/MQTTAsync.c#L1099

Output from gdb:

Program terminated with signal 11, Segmentation fault.
#0  0x0fed5b98 in MQTTAsync_freeCommand1 (command=0x0) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1024
1024        if (command->command.type == SUBSCRIBE)
(gdb) up
#1  0x0fed5d50 in MQTTAsync_freeCommand (command=0x0) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1054
1054        MQTTAsync_freeCommand1(command);
(gdb) up
#2  0x0fed5f68 in MQTTAsync_writeComplete (socket=26) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1105
1105                MQTTAsync_freeCommand(com);
(gdb) bt
#0  0x0fed5b98 in MQTTAsync_freeCommand1 (command=0x0) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1024
#1  0x0fed5d50 in MQTTAsync_freeCommand (command=0x0) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1054
#2  0x0fed5f68 in MQTTAsync_writeComplete (socket=26) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1105
#3  0x0feed364 in Socket_continueWrites (pwset=0x54084bcc) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:822
#4  0x0feeba00 in Socket_getReadySocket (more_work=0, tp=0x54084c8c) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:264
#5  0x0fedb7dc in MQTTAsync_cycle (sock=0x54084d28, timeout=1000, rc=0x54084d24) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:2916
#6  0x0fed7cb0 in MQTTAsync_receiveThread (n=0x102bda84) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1740
#7  0x0e842a6c in ?? ()
#8  0x0e4ec8a0 in ?? ()
JuergenKosel commented 6 years ago

Hello,

in another run of my test the segmentation fault occurred some lines earlier: https://github.com/eclipse/paho.mqtt.c/blob/c85e4a2fd6e96ebc35caef99a024d94c9b664ba0/src/MQTTAsync.c#L1094

Program terminated with signal 11, Segmentation fault.
#0  0x6a2f6560 in ?? ()
(gdb) bt
#0  0x6a2f6560 in ?? ()
#1  0x0fed5f40 in MQTTAsync_writeComplete (socket=26) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1100
#2  0x0feed370 in Socket_continueWrites (pwset=0x53884bcc) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:822
#3  0x0feeba0c in Socket_getReadySocket (more_work=0, tp=0x53884c8c) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:264
#4  0x0fedb7e8 in MQTTAsync_cycle (sock=0x53884d28, timeout=1000, rc=0x53884d24) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:2917
#5  0x0fed7cbc in MQTTAsync_receiveThread (n=0x102f29b4) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1741
#6  0x0e842a6c in ?? ()
#7  0x0e4ec8a0 in ?? ()
(gdb) up
#1  0x0fed5f40 in MQTTAsync_writeComplete (socket=26) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1100
1100                    (*(command->onSuccess))(command->context, &data);
(gdb) print command
$1 = (MQTTAsync_command *) 0x5c6a474c
(gdb) print *command
Cannot access memory at address 0x5c6a474c
(gdb) print data
$2 = {token = 1869967971, alt = {qos = 0, qosList = 0x0, pub = {message = {struct_id = "\000\000\000", struct_version = 267256228, payloadlen = 1885431919, payload = 0x2e6d7174, qos = 1949197103, retained = 1936876335, dup = 267437916, msgid = 1401441120}, destinationName = 0x756c6c2f <Address 0x756c6c2f out of bounds>}, connect = {serverURI = 0x0, MQTTVersion = 267256228, sessionPresent = 1885431919}}}
(gdb) print m
$3 = (MQTTAsyncs *) 0x102f29b4
(gdb) print *m
$4 = {serverURI = 0x102f2b1c "SLGB-VM-MQTTSRV.inat.internal:1883", ssl = 0, c = 0x102f2d2c, cl = 0x100bc6ac <MqttBroker::MQTTAsync_connectionLost(void*, char*)>, ma = 0x100bc634 <MqttBroker::MQTTAsync_messageArrived(void*, char*, int, MQTTAsync_message*)>, dc = 0x100bc668 <MqttBroker::MQTTAsync_deliveryComplete(void*, int)>, context = 0x102c7e38, connected = 0x0, connected_context = 0x0, connect = {type = 1, onSuccess = 0x100bc7f4 <MqttBroker::MQTTAsync_onConnectSuccess(void*, MQTTAsync_successData*)>, onFailure = 0x100bca08 <MqttBroker::MQTTAsync_onConnectFailure(void*, MQTTAsync_failureData*)>, token = 0, context = 0x102c7e38, start_time = {tv_sec = 1512079421, tv_usec = 821600}, details = {sub = {count = 0, topics = 0x3, qoss = 0x0}, unsub = {count = 0, topics = 0x3}, pub = {destinationName = 0x0, payloadlen = 3, payload = 0x0, qos = 0, retained = 0}, dis = {internal = 0, timeout = 3}, conn = {currentURI = 0, MQTTVersion = 3}}}, disconnect = {type = 14, onSuccess = 0x0, onFailure = 0x0, token = 0, context = 0x0, start_time = {tv_sec = 1512079385, tv_usec = 37981}, details = {sub = {count = 1, topics = 0x0, qoss = 0x0}, unsub = {count = 1, topics = 0x0}, pub = {destinationName = 0x1 <Address 0x1 out of bounds>, payloadlen = 0, payload = 0x0, qos = 0, retained = 0}, dis = {internal = 1, timeout = 0}, conn = {currentURI = 1, MQTTVersion = 0}}}, pending_write = 0x5c6a474c, responses = 0x102f2bdc, command_seqno = 0, pack = 0x0, createOptions = 0x0, shouldBeConnected = 1, automaticReconnect = 0, minRetryInterval = 1, maxRetryInterval = 60, serverURIcount = 0, serverURIs = 0x0, connectTimeout = 2, currentInterval = 0, lastConnectionFailedTime = {tv_sec = 0, tv_usec = 0}, retrying = 0, reconnectNow = 0}
(gdb) print m->pending_write
$5 = (MQTTAsync_command *) 0x5c6a474c

This time the crash happened, because an invalid function pointer was called. And this could be the same thing what appeared to me as a stack corruption some days ago.

JuergenKosel commented 6 years ago

The previous crashes happened with the old unmodified code. The following one occurred after merging in e18db33e79c960dbf492653e5a0d1c54716a7f9b from branch 373:

Program terminated with signal 11, Segmentation fault.
#0  0x0fed5c40 in MQTTAsync_freeCommand1 (command=0x0) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1024
1024        if (command->command.type == SUBSCRIBE)
(gdb) bt
#0  0x0fed5c40 in MQTTAsync_freeCommand1 (command=0x0) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1024
#1  0x0fed5df8 in MQTTAsync_freeCommand (command=0x0) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1054
#2  0x0fed6010 in MQTTAsync_writeComplete (socket=26) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1105
#3  0x0feed610 in Socket_continueWrites (pwset=0x53884bcc) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:859
#4  0x0feebc20 in Socket_getReadySocket (more_work=0, tp=0x53884c8c) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:267
#5  0x0fedb940 in MQTTAsync_cycle (sock=0x53884d28, timeout=1000, rc=0x53884d24) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:2923
#6  0x0fed7d88 in MQTTAsync_receiveThread (n=0x102f29b4) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1742
#7  0x0e842a6c in ?? ()
#8  0x0e4ec8a0 in ?? ()
(gdb) up
#1  0x0fed5df8 in MQTTAsync_freeCommand (command=0x0) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1054
1054        MQTTAsync_freeCommand1(command);
(gdb) up
#2  0x0fed6010 in MQTTAsync_writeComplete (socket=26) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1105
1105                MQTTAsync_freeCommand(com);
(gdb) print m
$1 = (MQTTAsyncs *) 0x102f29b4
(gdb) print *m
$2 = {serverURI = 0x102f2b1c "SLGB-VM-MQTTSRV.inat.internal:1883", ssl = 0, c = 0x102f2d2c, cl = 0x100bc688 <MqttBroker::MQTTAsync_connectionLost(void*, char*)>, ma = 0x100bc610 <MqttBroker::MQTTAsync_messageArrived(void*, char*, int, MQTTAsync_message*)>, dc = 0x100bc644 <MqttBroker::MQTTAsync_deliveryComplete(void*, int)>, context = 0x102c7e38, connected = 0x0, connected_context = 0x0, connect = {type = 1, onSuccess = 0x100bc7d0 <MqttBroker::MQTTAsync_onConnectSuccess(void*, MQTTAsync_successData*)>, onFailure = 0x100bc9e4 <MqttBroker::MQTTAsync_onConnectFailure(void*, MQTTAsync_failureData*)>, token = 0, context = 0x102c7e38, start_time = {tv_sec = 1512115558, tv_usec = 717215}, details = {sub = {count = 0, topics = 0x3, qoss = 0x0}, unsub = {count = 0, topics = 0x3}, pub = {destinationName = 0x0, payloadlen = 3, payload = 0x0, qos = 0, retained = 0}, dis = {internal = 0, timeout = 3}, conn = {currentURI = 0, MQTTVersion = 3}}}, disconnect = {type = 14, onSuccess = 0x0, onFailure = 0x0, token = 0, context = 0x0, start_time = {tv_sec = 1512115512, tv_usec = 732711}, details = {sub = {count = 1, topics = 0x0, qoss = 0x0}, unsub = {count = 1, topics = 0x0}, pub = {destinationName = 0x1 <Address 0x1 out of bounds>, payloadlen = 0, payload = 0x0, qos = 0, retained = 0}, dis = {internal = 1, timeout = 0}, conn = {currentURI = 1, MQTTVersion = 0}}}, pending_write = 0x0, responses = 0x102f2bdc, command_seqno = 0, pack = 0x0, createOptions = 0x0, shouldBeConnected = 1, automaticReconnect = 0, minRetryInterval = 1, maxRetryInterval = 60, serverURIcount = 0, serverURIs = 0x0, connectTimeout = 2, currentInterval = 0, lastConnectionFailedTime = {tv_sec = 0, tv_usec = 0}, retrying = 0, reconnectNow = 0}
(gdb) print m->pending_write
$3 = (MQTTAsync_command *) 0x0
(gdb) print command
$4 = (MQTTAsync_command *) 0x1032075c
(gdb) print com
$5 = (MQTTAsync_queuedCommand *) 0x0
(gdb) print *command
$6 = {type = 3, onSuccess = 0x100b8acc <MqttMapper::MQTTAsync_onWriteSuccess(void*, MQTTAsync_successData*)>, onFailure = 0x100b8b28 <MqttMapper::MQTTAsync_onWriteFailure(void*, MQTTAsync_failureData*)>, token = 0, context = 0x102c3fa8, start_time = {tv_sec = 1512115515, tv_usec = 209927}, details = {sub = {count = 271666196, topics = 0x66, qoss = 0x10329594}, unsub = {count = 271666196, topics = 0x66}, pub = {destinationName = 0x10314c14 "klj/PLC_CIP_NEU/MILL_AMT_MALT", payloadlen = 102, payload = 0x10329594, qos = 0, retained = 0}, dis = {internal = 271666196, timeout = 102}, conn = {currentURI = 271666196, MQTTVersion = 102}}}

The following file contains the trace output before the crash: crash.log

icraggs commented 6 years ago

So, this is on a reconnect attempt?

JuergenKosel commented 6 years ago

No this happens sometimes when the connection gets lost. In the crash.log file you see the output between moment when socket becomes interrupted (send buffer full): 20171201 090558.721 CONNECT sent, *rc is -22 the moment when the kernel / libc knows that the connection is lost: 20171201 090600.948 Socket error No route to host(113) in writev - putdatas for socket 26 and the final crash.

To reproduce this error, we use a python script, which cyclically connects and disconnects the virtual machine running the broker from the LAN.

icraggs commented 6 years ago

I think "CONNECT sent" implies a connection or reconnection attempt, whether automatic or initiated by the application. I haven't been able to reproduce this one so far - the symptoms of CONNECT receiving a -22, yes, but not the invalid memory access. I think I don't understand the scenario completely.

However, I've added some protection in writeComplete which should avoid some scenarios. You could try that and see if it changes the behaviour for you.

JuergenKosel commented 6 years ago

Hello @icraggs

with your latest commit it looks much better on the first view. As far as I could see, there is also no memory lost.

Since it is a problem of hitting the right moment, I will test it automatically over the weekend.

Thanks Juergen

icraggs commented 6 years ago

Thanks. Have you automated the network failing?

JuergenKosel commented 6 years ago

Yes, it is a python script which disconnects the virtual machine of the broker from the LAN.

JuergenKosel commented 6 years ago

Unfortunately there is still a segmentation fault:

Program terminated with signal 11, Segmentation fault.
#0  0x6a2f6560 in ?? ()
(gdb) bt
#0  0x6a2f6560 in ?? ()
#1  0x0fed5fe8 in MQTTAsync_writeComplete (socket=26) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1102
#2  0x0feed610 in Socket_continueWrites (pwset=0x53884bcc) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:859
#3  0x0feebc20 in Socket_getReadySocket (more_work=0, tp=0x53884c8c) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:267
#4  0x0fedb940 in MQTTAsync_cycle (sock=0x53884d28, timeout=1000, rc=0x53884d24) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:2925
#5  0x0fed7d88 in MQTTAsync_receiveThread (n=0x102f29b4) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1744
#6  0x0e842a6c in ?? ()
#7  0x0e4ec8a0 in ?? ()
(gdb) up
#1  0x0fed5fe8 in MQTTAsync_writeComplete (socket=26) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1102
1102                      (*(command->onSuccess))(command->context, &data);
(gdb) print cur_response
$1 = (ListElement *) 0x102f5264
(gdb) print command
$2 = (MQTTAsync_command *) 0x59909e0c
(gdb) print m
$3 = (MQTTAsyncs *) 0x102f29b4
(gdb) print *m
$4 = {serverURI = 0x102f2b1c "SLGB-VM-MQTTSRV.inat.internal:1883", ssl = 0, c = 0x102f2d2c, cl = 0x100bc688 <MqttBroker::MQTTAsync_connectionLost(void*, char*)>, ma = 0x100bc610 <MqttBroker::MQTTAsync_messageArrived(void*, char*, int, MQTTAsync_message*)>, dc = 0x100bc644 <MqttBroker::MQTTAsync_deliveryComplete(void*, int)>, context = 0x102c7e38, connected = 0x0, connected_context = 0x0, connect = {type = 1, onSuccess = 0x100bc7d0 <MqttBroker::MQTTAsync_onConnectSuccess(void*, MQTTAsync_successData*)>, onFailure = 0x100bc9e4 <MqttBroker::MQTTAsync_onConnectFailure(void*, MQTTAsync_failureData*)>, token = 0, context = 0x102c7e38, start_time = {tv_sec = 1512134541, tv_usec = 657709}, details = {sub = {count = 0, topics = 0x3, qoss = 0x0}, unsub = {count = 0, topics = 0x3}, pub = {destinationName = 0x0, payloadlen = 3, payload = 0x0, qos = 0, retained = 0}, dis = {internal = 0, timeout = 3}, conn = {currentURI = 0, MQTTVersion = 3}}}, disconnect = {type = 14, onSuccess = 0x0, onFailure = 0x0, token = 0, context = 0x0, start_time = {tv_sec = 1512134501, tv_usec = 721577}, details = {sub = {count = 1, topics = 0x0, qoss = 0x0}, unsub = {count = 1, topics = 0x0}, pub = {destinationName = 0x1 <Address 0x1 out of bounds>, payloadlen = 0, payload = 0x0, qos = 0, retained = 0}, dis = {internal = 1, timeout = 0}, conn = {currentURI = 1, MQTTVersion = 0}}}, pending_write = 0x59909e0c, responses = 0x102f2bdc, command_seqno = 0, pack = 0x0, createOptions = 0x0, shouldBeConnected = 1, automaticReconnect = 0, minRetryInterval = 1, maxRetryInterval = 60, serverURIcount = 0, serverURIs = 0x0, connectTimeout = 2, currentInterval = 0, lastConnectionFailedTime = {tv_sec = 0, tv_usec = 0}, retrying = 0, reconnectNow = 0}
(gdb) print m->pending_write
$5 = (MQTTAsync_command *) 0x59909e0c

Which means, that catching cur_response == NULL is not enough. Because m->pending_write could be an invalid pointer!

JuergenKosel commented 6 years ago

Looking at the log file crash2.log indicates that this crash happend in a different situation: This time the segmentation fault occurred, during the 5th attempt to reconnect!

icraggs commented 6 years ago

In your previous post, you said this was not at reconnect time. I think that it is actually occurring at reconnect time - it's important to be clear on the scenario.

Is this an automatic reconnect, or an explicit one in the application? If explicit, is it made in the connectionLost callback or elsewhere?

icraggs commented 6 years ago

It would be good if you could attach the full trace, if you have it. Or is this the section after the first -22 was hit, so this is the whole thing?

JuergenKosel commented 6 years ago

The previously reported segmentation faults occurred at the time when the kernel reports to the library that the connection is broken. Shortly before the library should call the connectionLost callback function. I think, this is fixed by your commit 6fa19a82e0452a89d5dfb5f7b3ec2a50398b5448 .

In my application the connectionLost callback function queues a task into a worker thread, which waits some times and then calls MQTTAsync_connect(). So the MQTTAsync_connect() is not called in the context of the connectionLost callback!

This time another segmentation fault occurred during the attempt to connect. Because a connect command has no pending_write. I think this might be a slightly other issue.

JuergenKosel commented 6 years ago

Unfortunately crash2.log contains all the relevant output from mqtt library.

Usually there are to many traces. So my application sets ERROR trace level. But I have modified the end of put_datas() as follows:

        if (rc == TCPSOCKET_INTERRUPTED) Log_setTraceLevel(TRACE_MINIMUM);
    FUNC_EXIT_RC(rc);
    return rc;

So I get more trace output when the socket becomes interrupted.

The only thing I can do, is to recompile my application and hope that the crash occurs again :-(

icraggs commented 6 years ago

Enough of the scenario to be able to understand and ideally reproduce in a test is the important thing. I've added some further changes to the 373 branch, with corrections that look relevant. I've not had chance to try them out thoroughly yet, but if you want to try them, go ahead.

icraggs commented 6 years ago

I've tested those changes more now, and so far, they're looking good - no regressions.

JuergenKosel commented 6 years ago

Hello, I have continued to test with commit c640561c560ea1dee01d3a1a4612becbf34cce90 over night. Unfortunally another segmentation fault occurred (after the 564th disconnect): crash3.log gdb output:

Program terminated with signal 11, Segmentation fault.
#0  0x0fed5aa8 in MQTTProtocol_checkPendingWrites () at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:996
996             if (Socket_noPendingWrites(((pending_write*)(le->content))->socket))
(gdb) bt
#0  0x0fed5aa8 in MQTTProtocol_checkPendingWrites () at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:996
#1  0x0fed5e80 in MQTTAsync_writeComplete (socket=26, rc=-1) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1066
#2  0x0feed6f0 in Socket_continueWrites (pwset=0x53884bcc) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:865
#3  0x0feebce0 in Socket_getReadySocket (more_work=0, tp=0x53884c8c) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:267
#4  0x0fedba00 in MQTTAsync_cycle (sock=0x53884d28, timeout=1000, rc=0x53884d24) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:2941
#5  0x0fed7e48 in MQTTAsync_receiveThread (n=0x102f29b4) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1760
#6  0x0e842a6c in ?? ()
#7  0x0e4ec8a0 in ?? ()
(gdb) print le
$1 = (ListElement *) 0x59a017f4
(gdb) print *le
$2 = {prev = 0x65732d67, next = 0x69742d73, content = 0x766e2d66}
(gdb) print *le->content
Attempt to dereference a generic pointer.
(gdb) print (pending_write*)(le->content)
$3 = (pending_write *) 0x766e2d66
(gdb) print *((pending_write*)(le->content))
Cannot access memory at address 0x766e2d66
icraggs commented 6 years ago

What about memory? During those 563 iterations, was there any increase, or can't you tell?

JuergenKosel commented 6 years ago

Hello,

the memory leak issue #373 seems to be fixed. I do not see any growing of memory consumption anymore.

The last segmentation fault happened after about 12 hours of testing. With a network interruption for the broker each 1-2 minutes. So it will be hard to reproduce. :-( Maybe it is a completely different issue.

I think, we should use the fixes for the previously found issues. And we should keep in mind, that there is at least one more to fix.

icraggs commented 6 years ago

There's always one more with this sort of testing :-) I'm looking at this latest one

JuergenKosel commented 6 years ago

Hello,

this time it took just 90 minutes to reproduce the segfault:

Program terminated with signal 11, Segmentation fault.
#0  0x0fed5aa8 in MQTTProtocol_checkPendingWrites () at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:996
996             if (Socket_noPendingWrites(((pending_write*)(le->content))->socket))
(gdb) bt
#0  0x0fed5aa8 in MQTTProtocol_checkPendingWrites () at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:996
#1  0x0fed5e80 in MQTTAsync_writeComplete (socket=26, rc=-1) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1066
#2  0x0feed6f0 in Socket_continueWrites (pwset=0x53884bcc) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:865
#3  0x0feebce0 in Socket_getReadySocket (more_work=0, tp=0x53884c8c) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/Socket.c:267
#4  0x0fedba00 in MQTTAsync_cycle (sock=0x53884d28, timeout=1000, rc=0x53884d24) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:2941
#5  0x0fed7e48 in MQTTAsync_receiveThread (n=0x102f29b4) at /home/klj/echo-sources-git-svn-full/paho.mqtt.c/src/MQTTAsync.c:1760
#6  0x0e842a6c in ?? ()
#7  0x0e4ec8a0 in ?? ()
(gdb) print *((pending_write*)(le->content))
Cannot access memory at address 0x6f636f6c

And the logfile is also nearly (but not exact) the same: crash4.log

JuergenKosel commented 6 years ago

Maybe the following crash log gives more information: memory_corruption1.log This time the application was aborted due to memory corruption and the library has reported a double free / invalid free of a linked list element!

icraggs commented 6 years ago

Thanks. That may well help.

JuergenKosel commented 6 years ago

In the meantime I have merged my test from pr #386 into my local copy of your 373 branch. And with an increased number of connect iterations the test reports also a problem: Not freed memory! I repeat this test. And if this could reproduce this issue, I would update one of my pull requests.

JuergenKosel commented 6 years ago

With the latest update to pull request #384 I see the following result:

20171205 163536.573 connectCnt 39

20171205 163803.873 Disconnected from MQTT broker reason (null)
20171205 163803.873 Failed to queue message for send with retvalue -3
20171205 163803.973 Connected 0 connectCnt 39

20171205 163803.973 PublishCnt 2934339, FailedCnt 65, Pending 48725 maxPending 109866
20171205 163803.973 MQTT mem current 10915872, max 24611920
20171205 163803.973 Connected 0 connectCnt 39

20171205 163803.973 PublishCnt 2934339, FailedCnt 65, Pending 48725 maxPending 109866
20171205 163803.973 MQTT mem current 10916016, max 24611920
20171205 163803.981 Established MQTT connection to SLGB-VM-MQTTSRV.inat.internal:1883
20171205 163803.981   MQTT version 3.1.1

20171205 163803.981 connectCnt 40

20171205 163811.974 PublishCnt 2938667, FailedCnt 65, Pending 44397 maxPending 109866
20171205 163811.974 MQTT mem current 9946448, max 24611920
20171205 163811.974 PublishCnt 2938667, FailedCnt 65, Pending 44397 maxPending 109866
20171205 163811.974 MQTT mem current 9946592, max 24611920
20171205 163812.290 Some memory not freed at shutdown, possible memory leak
20171205 163812.290 Heap scan start, total 256 bytes
20171205 163812.290 Heap element size 32, line 766, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTProtocolClient.c, ptr 0x55b3a5e1c370
20171205 163812.290   Content           
20171205 163812.290 Heap element size 32, line 215, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTProtocolClient.c, ptr 0x7fdb0c001bb0
20171205 163812.371   Content           
20171205 163812.371 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fdb0c002cf0
20171205 163812.371   Content           
20171205 163812.371 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fdb0c002d10
20171205 163812.371   Content           
20171205 163812.371 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fdb0c002f20
20171205 163812.371   Content           
20171205 163812.371 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fdb0c003260
20171205 163812.371   Content           
20171205 163812.371 Heap element size 32, line 95, file /home/klj/org.eclipse.paho.mqtt.c/src/LinkedList.c, ptr 0x7fdb0c003650
20171205 163812.371   Content           
20171205 163812.371 Heap element size 32, line 95, file /home/klj/org.eclipse.paho.mqtt.c/src/LinkedList.c, ptr 0x7fdb0c003710
20171205 163812.371   Content           
20171205 163812.371 Heap element size 16, line 106, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTProtocolClient.c, ptr 0x7fdb0c0037b0
20171205 163812.371   Content           
20171205 163812.371 Heap element size 48, line 232, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTProtocolClient.c, ptr 0x7fdb0c0038e0
20171205 163812.371   Content           
20171205 163812.371 Heap scan end
20171205 163812.500 MQTT mem current 256, max 24611920
20171205 163812.500 verdict fail
JuergenKosel commented 6 years ago

This crash log file memory_corruption2.log is similar to the previous one, except that you could see that the library complains twice about double free / invalid free of linked list elements, called in MQTTProtocol_checkPendingWrites()

JuergenKosel commented 6 years ago

With the latest commit from pr #384 I have reproduced a segmentation fault again. Unfortunately cmake build does not build with debug symbols by default :-(

Thread 3 "test_issue373" received signal SIGPIPE, Broken pipe.

Thread 2 "test_issue373" received signal SIGPIPE, Broken pipe.

Thread 2 "test_issue373" received signal SIGPIPE, Broken pipe.
20171207 151221.334 Disconnected from MQTT broker reason (null)
20171207 151221.433 Connected 0 connectCnt 88

20171207 151221.433 PublishCnt 6617007, FailedCnt 137, Pending 48291 maxPending 240925
20171207 151221.433 MQTT mem current 10820688, max 53967792
20171207 151221.433 Connected 0 connectCnt 88

20171207 151221.433 PublishCnt 6617007, FailedCnt 137, Pending 48291 maxPending 240925
20171207 151221.433 MQTT mem current 10820832, max 53967792
20171207 151221.438 Established MQTT connection to SLGB-VM-MQTTSRV.inat.internal:1883
20171207 151221.438   MQTT version 3.1.1

20171207 151221.438 connectCnt 89

Thread 2 "test_issue373" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff735c700 (LWP 17859)]
0x00007ffff7b3a071 in intcompare () from /home/klj/org.eclipse.paho.mqtt.c/build.paho/src/libpaho-mqtt3a.so.1
(gdb) bt
#0  0x00007ffff7b3a071 in intcompare () from /home/klj/org.eclipse.paho.mqtt.c/build.paho/src/libpaho-mqtt3a.so.1
#1  0x00007ffff7b399ec in ListFindItem () from /home/klj/org.eclipse.paho.mqtt.c/build.paho/src/libpaho-mqtt3a.so.1
#2  0x00007ffff7b32617 in Socket_noPendingWrites ()
   from /home/klj/org.eclipse.paho.mqtt.c/build.paho/src/libpaho-mqtt3a.so.1
#3  0x00007ffff7b26c30 in MQTTAsync_processCommand ()
   from /home/klj/org.eclipse.paho.mqtt.c/build.paho/src/libpaho-mqtt3a.so.1
#4  0x00007ffff7b27c02 in MQTTAsync_sendThread ()
   from /home/klj/org.eclipse.paho.mqtt.c/build.paho/src/libpaho-mqtt3a.so.1
#5  0x00007ffff7364494 in start_thread (arg=0x7ffff735c700) at pthread_create.c:333
#6  0x00007ffff7866aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
(gdb) print a
No symbol "a" in current context.
(gdb) up
#1  0x00007ffff7b399ec in ListFindItem () from /home/klj/org.eclipse.paho.mqtt.c/build.paho/src/libpaho-mqtt3a.so.1
icraggs commented 6 years ago

I presume you are using the latest code from branch 373, at least:

https://github.com/eclipse/paho.mqtt.c/commit/c640561c560ea1dee01d3a1a4612becbf34cce90

(The subsequent test changes are not important, obviously). Can you summarize which symptoms you have seen with this latest version of the code, as opposed to earlier versions, thanks. Otherwise I don't know what is still outstanding. You don't have to go into detail again, just reference the applicable comments you have already made.

JuergenKosel commented 6 years ago

Hello,

if there are only a few disconnect / reconnects, everything looks fine.

Beside the added test application I use a python script, which cyclically connects / disconnects a virtual machine running the broker, each 70 to 120 seconds.

JuergenKosel commented 6 years ago

I forgot to mention that I currently test with commit b4bdb89861bb72537b8aa564db2d56b311b96803 especially commit 2761fcea7eaaad5937da3a11d0e87c5a516540b1 is important to get the TCPSOCKET_INTERRUPTED situation on a none embedded device.

JuergenKosel commented 6 years ago

Now I have some more useful gdb results:

20171207 165147.287 connectCnt 28

Thread 2 "test_issue373" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff735c700 (LWP 721)]
0x00007ffff7b3a071 in intcompare (a=0x0, b=0x7ffff735be6c) at /home/klj/org.eclipse.paho.mqtt.c/src/LinkedList.c:436
436     return *((int*)a) == *((int*)b);
(gdb) bt
#0  0x00007ffff7b3a071 in intcompare (a=0x0, b=0x7ffff735be6c) at /home/klj/org.eclipse.paho.mqtt.c/src/LinkedList.c:436
#1  0x00007ffff7b399ec in ListFindItem (aList=0x555555758e94, content=0x7ffff735be6c, 
    callback=0x7ffff7b3a061 <intcompare>) at /home/klj/org.eclipse.paho.mqtt.c/src/LinkedList.c:157
#2  0x00007ffff7b32617 in Socket_noPendingWrites (socket=3) at /home/klj/org.eclipse.paho.mqtt.c/src/Socket.c:403
#3  0x00007ffff7b26c30 in MQTTAsync_processCommand () at /home/klj/org.eclipse.paho.mqtt.c/src/MQTTAsync.c:1157
#4  0x00007ffff7b27c02 in MQTTAsync_sendThread (n=0x0) at /home/klj/org.eclipse.paho.mqtt.c/src/MQTTAsync.c:1518
#5  0x00007ffff7364494 in start_thread (arg=0x7ffff735c700) at pthread_create.c:333
#6  0x00007ffff7866aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
(gdb) up
#1  0x00007ffff7b399ec in ListFindItem (aList=0x555555758e94, content=0x7ffff735be6c, 
    callback=0x7ffff7b3a061 <intcompare>) at /home/klj/org.eclipse.paho.mqtt.c/src/LinkedList.c:157
157            (callback != NULL && callback(aList->current->content, content))))
(gdb) print alist
No symbol "alist" in current context.
(gdb) print aList
$1 = (List *) 0x555555758e94
(gdb) print aList->current 
$2 = (ListElement *) 0x0
(gdb) print *aList
$3 = {first = 0x0, last = 0x0, current = 0x0, count = 0, size = 4844}

This is strange, because ListFindItem() begins with the following statement:

                if (aList->current != NULL && ((callback == NULL && aList->current->content == content) ||
           (callback != NULL && callback(aList->current->content, content))))
        rc = aList->current;
JuergenKosel commented 6 years ago

Hello @icraggs

in https://github.com/eclipse/paho.mqtt.c/blob/c85e4a2fd6e96ebc35caef99a024d94c9b664ba0/src/MQTTAsync.c#L2907 the socket_mutex is locked before the call of Socket_getReadySocket(). But in https://github.com/eclipse/paho.mqtt.c/blob/c85e4a2fd6e96ebc35caef99a024d94c9b664ba0/src/MQTTAsync.c#L1132 calls Socket_noPendingWrites() without protection by socket_mutex. So it looks possible, that Socket_getReadySocket() -> Socket_continueWrites() -> (!ListRemove(s.write_pending, curpending->content) frees the list element which is currently used by ListFindItem() in the crash above!

JuergenKosel commented 6 years ago

Hello,

I have continued testing with the sources and test in pull request #384 and 3 embedded devices. None of them has crashed over the weekend. Therefore I believe, that this issue #385 is fixed now. But regarding the memory leak ( #373 ) there is still something left:

20171208 192159.702 connectCnt 100

20171208 192206.475 PublishCnt 754417, FailedCnt 13, Pending 4 maxPending 41095
20171208 192206.475 MQTT mem current 2800, max 17090864
20171208 192206.475 PublishCnt 754417, FailedCnt 13, Pending 4 maxPending 41095
20171208 192206.475 MQTT mem current 2944, max 17090864
20171208 192206.875 Some memory not freed at shutdown, possible memory leak
20171208 192206.875 Heap scan start, total 352 bytes
20171208 192206.875 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c000970
20171208 192206.875   Content           
20171208 192206.875 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c000a80
20171208 192206.875   Content           
20171208 192206.875 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c000ad0
20171208 192206.875   Content           
20171208 192206.875 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c000c40
20171208 192206.875   Content           
20171208 192206.875 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c000e40
20171208 192206.875   Content           
20171208 192206.875 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c001060
20171208 192206.875   Content           
20171208 192206.875 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c001130
20171208 192206.876   Content           
20171208 192206.876 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c001190
20171208 192206.876   Content           
20171208 192206.876 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c0011b0
20171208 192206.876   Content           
20171208 192206.876 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c0012f0
20171208 192206.876   Content           
20171208 192206.876 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c001870
20171208 192206.876   Content           
20171208 192206.876 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c001a60
20171208 192206.876   Content           
20171208 192206.876 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c001cd0
20171208 192206.876   Content           
20171208 192206.876 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c001d60
20171208 192206.876   Content           
20171208 192206.876 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c002260
20171208 192206.876   Content           
20171208 192206.876 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7fbe6c002280
20171208 192206.876   Content           
20171208 192206.876 Heap scan end
20171208 192206.895 MQTT mem current 352, max 17090864
20171208 192206.895 verdict fail
icraggs commented 6 years ago

Thanks Juergen. I can get that symptom too - I'm working on it. Is that with QoS 1 or 0?

JuergenKosel commented 6 years ago

It looks like, it happens with both (or even all) QoS settings.

icraggs commented 6 years ago

Thanks

icraggs commented 6 years ago

I've added some changes to the 373 branch which fix this for me, on QoS 1.

JuergenKosel commented 6 years ago

Hello,

in my test_issue373 I still see some memory lost with QoS=1:

20171212 095830.050 connectCnt 5

20171212 095838.004 PublishCnt 132038, FailedCnt 2, Pending -1 maxPending 65534
20171212 095838.004 MQTT mem current 1392, max 27233456
20171212 095838.004 PublishCnt 132038, FailedCnt 2, Pending -1 maxPending 65534
20171212 095838.004 MQTT mem current 1536, max 27233456
20171212 095838.104 Some memory not freed at shutdown, possible memory leak
20171212 095838.104 Heap scan start, total 96 bytes
20171212 095838.104 Heap element size 16, line 704, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7f70a8002cc0
20171212 095838.104   Content           
20171212 095838.104 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7f70a80032d0
20171212 095838.104   Content           
20171212 095838.104 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7f70a82a8040
20171212 095838.104   Content           
20171212 095838.104 Heap element size 16, line 696, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7f70a8767710
20171212 095838.104   Content           
20171212 095838.104 Heap element size 16, line 704, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7f70a876a970
20171212 095838.104   Content           
20171212 095838.104 Heap element size 16, line 231, file /home/klj/org.eclipse.paho.mqtt.c/src/MQTTPacket.c, ptr 0x7f70a876a990
20171212 095838.104   Content           
20171212 095838.104 Heap scan end
20171212 095838.170 MQTT mem current 96, max 27233456
icraggs commented 6 years ago

Thanks. Actually, it's the same for me - I must have just been un/lucky yesterday :-)

icraggs commented 6 years ago

Juergen, I've added some further changes to the 373 branch, for when you would like to try them out. Thanks.

JuergenKosel commented 6 years ago

Hello @icraggs

I have merged in your branch up to commit 5edc4db34eabe78eb206e4cb429d3b3426bccfea , increased the number of iterations in my test for QoS 1 and 2 to 15 disconnects. Now the final results looks much better:

20171213 175234.497 connectCnt 100

20171213 175242.409 PublishCnt 940559, FailedCnt 38991, Pending 7 maxPending 51403
20171213 175242.409 MQTT mem current 105040, max 21374416
20171213 175242.409 PublishCnt 940559, FailedCnt 38991, Pending 7 maxPending 51403
20171213 175242.409 MQTT mem current 105184, max 21374416
20171213 175242.516 MQTT mem current 0, max 21374416
20171213 175242.516 verdict pass

I have also tested with 3 embedded devices. Each of them publishing with another QoS setting. The 2 devices publishing with QoS 0 and 1 have published more than 2 million messages over night. The one which publishes with QoS 2 has published lesss than 30.000 messages. None of the devices has crashed.

So I believe that that we have now fixes for both issues #373 and #385 .

JuergenKosel commented 6 years ago

Hello,

my college has just told me, that the device, which was publishing with QoS 2, has been restarted. I believe, that this has been caused by the overload for the huge amount of messages it has to keep for sending. During the last days we have also seen, that the mosquitto broker has been terminated due to our tests.

icraggs commented 6 years ago

I'm not sure what action you are suggesting needs to be taken, if any, on your last comment. With QoS 2 and 1 it's important that clients clean up any inflight messages by connecting with clean session at some point - that's just an observation I would make.

JuergenKosel commented 6 years ago

Hello,

our embedded application reconnects without clean session (as in test_issue373). In addition the mosquitto broker (1.4.10, Debian Stretch) crashed/terminated during the test. I do not know, if this could cause a problem regarding the not yet confirmed message ids with QoS2. (Does the protocol provide the option to the broker, to force clean session for the first (re)connect after broker restart?)

And to make it even more worse, our application lacks on flow control: Messages could be faster queued for sending than send through the "unstable" connection to the broker. Even with QoS=0 it was possible to increase the workload, so that more messages per connect/disconnect cycle are queued for sending than send during this time.

Which means, that I must add flow control into my application - not into this mqtt library.

JuergenKosel commented 6 years ago

Hello,

it looks like, that my commit 779ca42a46afd3c80bd90a2bac68c8412b0dfc4b reduces the performance drastically. But on the other hand, this commit prevents, that a list is freed in Socket_getReadySocket() which is currently in use in Socket_noPendingWrites(). Has anybody an idea, how to prevent memory corruption without big performance penalty here?

Greetings Juergen

icraggs commented 6 years ago

Hi Jurgen,

this is not a commit I'm currently using. I would try putting the Thread_lock_mutex(socket_mutex);/Thread_unlock_mutex pair on the outside of the

while (ListNextElement(commands, &cur_command)) { }

loop, and see if that helps.

JuergenKosel commented 6 years ago

Hello Ian,

I have already moved the Thread_lock_mutex(socket_mutex);/Thread_unlock_mutex pair on the outside of the while. But there is still a performance penalty. :-(

icraggs commented 6 years ago

Can you quantify the performance decrease?

JuergenKosel commented 6 years ago

Hello,

our tester told me, that 1000 topics were published in less than 1 second before this change. After this change, the 1000 topics are published in 6 to 10 seconds.

icraggs commented 6 years ago

Experiment with changing the timeout value where it is set

    MQTTAsync_unlock_mutex(mqttasync_mutex);
    pack = MQTTAsync_cycle(&sock, timeout, &rc);
    MQTTAsync_lock_mutex(mqttasync_mutex);
    if (tostop)
        break;
    timeout = 1000L;   // <- here

in MQTTAsync_receiveThread.