Azure / azure-iot-sdk-c

A C99 SDK for connecting devices to Microsoft Azure IoT services
https://azure.github.io/azure-iot-sdk-c
Other
588 stars 737 forks source link

Crash after network disconnection (AMQP/WS, x509) #310

Closed ghost closed 6 years ago

ghost commented 7 years ago

same setup/environment/repro steps as in #251 and #274

Although it's hard to believe it seems that immediatly after fixing issues #251 and #274 the problem reappeared.

...
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
IoTHubClient_LL_SendEventAsync accepted message [134] for transmission to IoT Hub.
Error: Time:Thu Nov 23 15:41:21 2017 File:azure-iot-sdk-c/uamqp/src/connection.c Func:connection_encode_frame Line:1841 Connection not open
*** Error in `./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509': munmap_chunk(): invalid pointer: 0x000000000252af28 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fee746677e5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x1a8)[0x7fee74674698]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(link_transfer_async+0x38a)[0x466d0d]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509[0x46c2d0]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(messagesender_send_async+0x31a)[0x46cece]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509[0x429ab8]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509[0x42a14b]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(telemetry_messenger_do_work+0x1ae)[0x42b62a]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(device_do_work+0x9a2)[0x423283]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509[0x41d632]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(IoTHubTransport_AMQP_Common_DoWork+0x294)[0x41e565]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509[0x41b48c]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(IoTHubClient_LL_DoWork+0x154)[0x418494]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(main+0x4bd)[0x413762]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fee74610830]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(_start+0x29)[0x412f99]
ewertons commented 7 years ago

@davidbuechi , it's unfortunate you are hitting this crash. The callstack you posted seems different than the previous one though. Would you have full, verbose logs as well (from the beginning of the execution to the crash)?

ghost commented 7 years ago
Starting the IoTHub client sample x509...
Info: IoT Hub SDK for C, version 1.1.28
Info: Retry policy set (5, timeout = 0)
Info: Retry policy set (1, timeout = 0)
IoTHubClient_LL_SetMessageCallback...successful.
IoTHubClient_LL_SendEventAsync accepted message [0] for transmission to IoT Hub.
-> Header (AMQP 0.1.0.0)
<- Header (AMQP 0.1.0.0)
-> [OPEN]* {46eadd1b-42e8-403b-8fac-431da4abb2a3,xxxx.azure-devices.net,4294967295,65535,240000}
<- [OPEN]* {DeviceGateway_c20f0fb826164fa991b1fd8943f99f77,10.0.8.29,65536,8191,240000,NULL,NULL,NULL,NULL,NULL}
-> [BEGIN]* {NULL,0,4294967295,100,4294967295}
<- [BEGIN]* {0,1,5000,4294967295,262143,NULL,NULL,NULL}
-> [ATTACH]* {link-snd-xxxx-7ae6e5be-6527-49a6-a256-63c9a6a696b9,0,false,0,0,* {link-snd-xxxx-7ae6e5be-6527-49a6-a256-63c9a6a696b9-source},* {amqps://xxxx.azure-devices.net/devices/xxxx/messages/events},NULL,NULL,0,18446744073709551615,NULL,NULL,{[com.microsoft:client-version:iothubclient/1.1.28 (native; Linux; x86_64)]}}
-> [ATTACH]* {link-snd-xxxx-dec47ff2-f9fe-4ee2-8f0c-5ec777af8978,1,false,0,0,* {link-snd-xxxx-dec47ff2-f9fe-4ee2-8f0c-5ec777af8978-source},* {amqps://xxxx.azure-devices.net/devices/xxxx/twin},NULL,NULL,0,18446744073709551615,NULL,NULL,{[com.microsoft:client-version:iothubclient/1.1.28 (native; Linux; x86_64)],[com.microsoft:channel-correlation-id:twin:5950e4c6-e14e-4bfe-945d-071f9289721d],[com.microsoft:api-version:2016-11-14]}}
<- [ATTACH]* {link-snd-xxxx-7ae6e5be-6527-49a6-a256-63c9a6a696b9,0,true,0,NULL,* {link-snd-xxxx-7ae6e5be-6527-49a6-a256-63c9a6a696b9-source,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL},* {amqps://xxxx.azure-devices.net/devices/xxxx/messages/events,NULL,NULL,NULL,NULL,NULL,NULL},NULL,NULL,NULL,1048576,NULL,NULL,{[com.microsoft:client-version:iothubclient/1.1.28 (native; Linux; x86_64)]}}
<- [FLOW]* {0,5000,1,4294967295,0,0,1000,0,NULL,false,NULL}
-> [ATTACH]* {link-rcv-xxxx-64054521-3db0-425d-8b5e-3b76923dc2c2,2,true,0,0,* {amqps://xxxx.azure-devices.net/devices/xxxx/messages/devicebound},* {link-rcv-xxxx-64054521-3db0-425d-8b5e-3b76923dc2c2-target},NULL,NULL,NULL,65536,NULL,NULL,{[com.microsoft:client-version:iothubclient/1.1.28 (native; Linux; x86_64)]}}
<- [ATTACH]* {link-snd-xxxx-dec47ff2-f9fe-4ee2-8f0c-5ec777af8978,1,true,1,0,* {link-snd-xxxx-dec47ff2-f9fe-4ee2-8f0c-5ec777af8978-source,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL},* {amqps://xxxx.azure-devices.net/devices/xxxx/twin,NULL,NULL,NULL,NULL,NULL,NULL},NULL,NULL,NULL,1048576,NULL,NULL,{[com.microsoft:client-version:iothubclient/1.1.28 (native; Linux; x86_64)],[com.microsoft:channel-correlation-id:twin:5950e4c6-e14e-4bfe-945d-071f9289721d],[com.microsoft:api-version:2016-11-14]}}
<- [FLOW]* {0,5000,1,4294967295,1,0,1000,0,NULL,false,NULL}
-> [ATTACH]* {link-rcv-xxxx-eb50b9f5-c80c-4769-92d8-90de5a731d71,3,true,0,0,* {amqps://xxxx.azure-devices.net/devices/xxxx/twin},* {link-rcv-xxxx-eb50b9f5-c80c-4769-92d8-90de5a731d71-target},NULL,NULL,NULL,18446744073709551615,NULL,NULL,{[com.microsoft:client-version:iothubclient/1.1.28 (native; Linux; x86_64)],[com.microsoft:channel-correlation-id:twin:5950e4c6-e14e-4bfe-945d-071f9289721d],[com.microsoft:api-version:2016-11-14]}}
<- [ATTACH]* {link-rcv-xxxx-64054521-3db0-425d-8b5e-3b76923dc2c2,2,false,NULL,1,* {amqps://xxxx.azure-devices.net/devices/xxxx/messages/devicebound,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL},* {link-rcv-xxxx-64054521-3db0-425d-8b5e-3b76923dc2c2-target,NULL,NULL,NULL,NULL,NULL,NULL},NULL,NULL,0,65536,NULL,NULL,{[com.microsoft:client-version:iothubclient/1.1.28 (native; Linux; x86_64)]}}
-> [FLOW]* {1,4294967295,0,100,2,0,10000}
<- [ATTACH]* {link-rcv-xxxx-eb50b9f5-c80c-4769-92d8-90de5a731d71,3,false,1,0,* {amqps://xxxx.azure-devices.net/devices/xxxx/twin,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL},* {link-rcv-xxxx-eb50b9f5-c80c-4769-92d8-90de5a731d71-target,NULL,NULL,NULL,NULL,NULL,NULL},NULL,NULL,0,1048576,NULL,NULL,{[com.microsoft:client-version:iothubclient/1.1.28 (native; Linux; x86_64)],[com.microsoft:channel-correlation-id:twin:5950e4c6-e14e-4bfe-945d-071f9289721d],[com.microsoft:api-version:2016-11-14]}}
-> [FLOW]* {1,4294967295,0,100,3,0,10000}
-> [TRANSFER]* {0,0,<01 00 00 00>,2147563264,false,false}
IoTHubClient_LL_SendEventAsync accepted message [1] for transmission to IoT Hub.
-> [TRANSFER]* {0,1,<02 00 00 00>,2147563264,false,false}
<- [DISPOSITION]* {true,0,NULL,true,* {},NULL}
Confirmation[0] received for message tracking id = 0 with result = IOTHUB_CLIENT_CONFIRMATION_OK
<- [DISPOSITION]* {true,1,NULL,true,* {},NULL}
Confirmation[1] received for message tracking id = 1 with result = IOTHUB_CLIENT_CONFIRMATION_OK
IoTHubClient_LL_SendEventAsync accepted message [2] for transmission to IoT Hub.
-> [TRANSFER]* {0,2,<03 00 00 00>,2147563264,false,false}
<- [DISPOSITION]* {true,2,NULL,true,* {},NULL}
Confirmation[2] received for message tracking id = 2 with result = IOTHUB_CLIENT_CONFIRMATION_OK
IoTHubClient_LL_SendEventAsync accepted message [3] for transmission to IoT Hub.
-> [TRANSFER]* {0,3,<04 00 00 00>,2147563264,false,false}
<- [DISPOSITION]* {true,3,NULL,true,* {},NULL}
Confirmation[3] received for message tracking id = 3 with result = IOTHUB_CLIENT_CONFIRMATION_OK

Network disconnection

IoTHubClient_LL_SendEventAsync accepted message [4] for transmission to IoT Hub.
-> [TRANSFER]* {0,4,<05 00 00 00>,2147563264,false,false}
IoTHubClient_LL_SendEventAsync accepted message [5] for transmission to IoT Hub.
-> [TRANSFER]* {0,5,<06 00 00 00>,2147563264,false,false}
IoTHubClient_LL_SendEventAsync accepted message [6] for transmission to IoT Hub.
-> [TRANSFER]* {0,6,<07 00 00 00>,2147563264,false,false}
IoTHubClient_LL_SendEventAsync accepted message [7] for transmission to IoT Hub.
-> [TRANSFER]* {0,7,<08 00 00 00>,2147563264,false,false}
...
IoTHubClient_LL_SendEventAsync accepted message [131] for transmission to IoT Hub.
-> [TRANSFER]* {0,131,<84 00 00 00>,2147563264,false,false}
IoTHubClient_LL_SendEventAsync accepted message [132] for transmission to IoT Hub.
-> [TRANSFER]* {0,132,<85 00 00 00>,2147563264,false,false}
IoTHubClient_LL_SendEventAsync accepted message [133] for transmission to IoT Hub.
-> [TRANSFER]* {0,133,<86 00 00 00>,2147563264,false,false}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
...
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
-> [CLOSE]* {* {amqp:internal-error,No frame received for the idle timeout}}
IoTHubClient_LL_SendEventAsync accepted message [134] for transmission to IoT Hub.
Error: Time:Thu Nov 23 15:41:21 2017 File:azure-iot-sdk-c/uamqp/src/connection.c Func:connection_encode_frame Line:1841 Connection not open
*** Error in `./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509': munmap_chunk(): invalid pointer: 0x000000000252af28 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fee746677e5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x1a8)[0x7fee74674698]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(link_transfer_async+0x38a)[0x466d0d]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509[0x46c2d0]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(messagesender_send_async+0x31a)[0x46cece]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509[0x429ab8]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509[0x42a14b]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(telemetry_messenger_do_work+0x1ae)[0x42b62a]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(device_do_work+0x9a2)[0x423283]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509[0x41d632]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(IoTHubTransport_AMQP_Common_DoWork+0x294)[0x41e565]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509[0x41b48c]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(IoTHubClient_LL_DoWork+0x154)[0x418494]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(main+0x4bd)[0x413762]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fee74610830]
./iothub_client/samples/iothub_client_sample_x509/iothub_client_sample_x509(_start+0x29)[0x412f99]
ewertons commented 6 years ago

Hi @davidbuechi , we got a repro. We will work on the fix and get it released soon.

ghost commented 6 years ago

Hi @ewertons I assume soon means that you are targeting 2017-12-01?

ewertons commented 6 years ago

Yes David, that is correct. We are working on this issue as we speak.

ewertons commented 6 years ago

@davidbuechi , the issue has been root-caused. It relates to https://github.com/Azure/azure-uamqp-c/issues/197

I will reply once the fix is checked in and available on our repo.

ewertons commented 6 years ago

As a follow up on a side issue, we are currently enabling additional sets of reconnection tests to prevent issues like this to make their way into our repo. We deeply apologize for the disruption.

ewertons commented 6 years ago

@davidbuechi , a fix has been checked into the master branch of azure-iot-sdk-c. Would you be able to validate it from your side as well?

ghost commented 6 years ago

Our plan is to start with our tests on monday. We'll keep you updated as soon as we have first results.

ghost commented 6 years ago

@ewertons to more precise: we will start our tests as soon as there is an official release.

In the past we did test some preliminary bugfixes and in the following release the functionality was broken again (see #274).

ewertons commented 6 years ago

Hi @davidbuechi , we had to reschedule our regular release that was supposed to start last Friday 12/1. There are changes we are making to our gated builds to improve test coverage. It means the next release will most likely happen only on 12/15. If you have chance to verify the functionality out of master, let us know. Thanks.

ghost commented 6 years ago

Hi @ewertons we will start testing 3ff0969 today.

BTW: we had a very strange issue with 2017-11-17: Connection was established and without transferring any messages the memory consumption started growing out of nowhere with a rate of almost 100MB per hour. After the oom killer terminated the process everything was fine again. We will test if this is reproducable with the current master

ewertons commented 6 years ago

Awesome, thanks @davidbuechi .

Please let us know about the issue you described. We do run valgrind on our solutions for every check in we make, so we guard well against memory leaks. The same is done for thread issues with helgrind.

ghost commented 6 years ago

Unfortunately I can't say more at the moment. We had OPTION_LOG_TRACE to false and the function registered by xlogging_set_log_function() did not report anything. Only our system monitor noticed the rapidly shrinking size of available memory, until the task got killed and restarted.

Oddly enough this happened on 2 devices more or less at the same time (on the same network) which makes me think that it maybe could be related to some kind of network irregularities...

ghost commented 6 years ago

from our point of view this issue can be closed. The original error (and the memory leaks mentioned above) seem to be fixed in the current master. Thx a lot!

ewertons commented 6 years ago

Thanks for letting us know, @davidbuechi . It is a pleasure to be able to offer a solution that attends the needs of our customers. Please let us know if we can assist in any other way.