Azure / azure-iot-sdk-c

A C99 SDK for connecting devices to Microsoft Azure IoT services
https://azure.github.io/azure-iot-sdk-c
Other
588 stars 740 forks source link

Segfault or abort VECTOR_push_back realloc failed during high frequency message generation #1521

Closed joehal9000 closed 4 years ago

joehal9000 commented 4 years ago

Development Machine, OS, Compiler

Oracle Linux Server release 7.7 using gcc 8.2.0, cmake 3.16.2

SDK Version

Release 2020-01-22

Protocol

MQTT

Describe the Bug We have encountered an issue where the program is killed by either an abort signal or a segmentation fault. This only happens when the program is sending a high frequency of messages. In these experiments, we were sending messages at a rate of 100/second. We ran 2 versions: one with 40 devices simultaneously and another with 10 devices. Both times they crashed. Both of these errors, according to the stack trace, happen in the SDK itself. The data structure within the SDK that keeps track of event confirmation callbacks from IoT Hub fails because of a realloc failure. This makes us believe that the program is running out of memory despite having checked for memory leaks.

We thought that it was just the amount of messages that was causing the problem. So we ran another experiment that sent the same amount of messages (300,000) over a much longer time. In this test, we were sending 10 messages a second until we hit 300,000, which took about 30 minutes. In this case, all 10 devices finished without error and with reasonable latencies.

Example Code

// standard libraries
#include <string>
#include <iostream>
#include <fstream>
#include <cstdlib> // for exit() and srand and atof
#include <cstring> // for strcpy, strlen
#include <ctime> // for converting time_t to string
#include <map>
#include <vector>
#include <deque>
#include <iterator>
#include <chrono> // for measuring time
#include <thread> // for sleeping
#include <algorithm> // for std::remove

// Azure libraries
#include "iothub.h"
#include "iothub_client.h"
#include "iothub_device_client.h" //Change from _LL
#include "iothub_client_options.h"
#include "iothub_message.h"
#include "azure_c_shared_utility/threadapi.h"
#include "azure_c_shared_utility/crt_abstractions.h"
#include "azure_c_shared_utility/shared_util_options.h"
#include "parson.h"

// using mqtt
#include "iothubtransportmqtt.h"

// for certs but can prob be removed
#ifdef SET_TRUSTED_CERT_IN_SAMPLES
    #include "certs.h"
#endif // SET_TRUSTED_CERT_IN_SAMPLES

using namespace std;

// Struct representing a message sent to iothub with a tracking id
struct IotHubMessage {
    IOTHUB_MESSAGE_HANDLE messageHandle;
    int id;
    IotHubMessage(IOTHUB_MESSAGE_HANDLE imh, int the_id) {
        messageHandle = imh;
        id = the_id;
    }
};

static void send_confirm_callback(IOTHUB_CLIENT_CONFIRMATION_RESULT result, void* userContextCallback) {
    // do some latency calculations
}

int main(int argc, char** argv) {
    // initialize the IotHub SDK
    (void)IoTHub_Init();

    IOTHUB_CLIENT_TRANSPORT_PROVIDER protocol = MQTT_Protocol;
    IOTHUB_DEVICE_CLIENT_HANDLE device_handle = IoTHubDeviceClient_CreateFromConnectionString(argv[1], protocol); // pass conn string through cmd line

    #ifdef SET_TRUSTED_CERT_IN_SAMPLES
    IoTHubDeviceClient_SetOption(device_handle, OPTION_TRUSTED_CERT, certificates);
    #endif

    bool trace_on = true;
    IoTHubClient_SetOption(device_handle, OPTION_LOG_TRACE, &trace_on);

    bool urlEncodeOn = true;
    IoTHubDeviceClient_SetOption(device_handle, OPTION_AUTO_URL_ENCODE_DECODE, &urlEncodeOn);

    int sendCount = 0;
    int total_msgs = 30000;

    while(sendCount < total_msgs) {
        // send message
        char *telemetry_msg = "i am a telemetry msg. usually i look like sensor data in json format";

        IotHubMessage *iothubMsg = new IotHubMessage(IoTHubMessage_CreateFromString(telemetry_msg), sendCount);

        // send the message to IoTHub
        IoTHubDeviceClient_SendEventAsync(device_handle, iothubMsg->messageHandle, send_confirm_callback, iothubMsg); // we use our 

        int intermessage = 10; // send message once every 10ms --> 100 msgs/sec

        // wait before generating data again
        this_thread::sleep_for(chrono::milliseconds(intermessage));
    }

    // Clean up the iothub sdk handle
    IoTHubDeviceClient_Destroy(device_handle);

    // Free all the sdk subsystem
    IoTHub_Deinit();
}

GDB and Logs We attached screenshots of gdb reading the core dump. As well as the stack trace from running where in gdb. gdbLog2_edited gdbWhere2_edited

Finally, we put the stdout in this gist.

Contacts This issue was co-written with @dndsmith.

@aapon00 @kfprugger

dndsmith commented 4 years ago

If it helps clarify things, this is a very condensed version of our actual program (which measures RTT latency). Thus, some of the libraries included are leftover from the full program and aren't important here (my apologies, needed to get it running quick). Also, when we say device, we mean each instance of the program is a device connecting to IoTHub. We have multiple processes i.e. devices running simultaneously.

ewertons commented 4 years ago

Hi @joehal9000 , @dndsmith , thanks for reporting this issue. We have long-haul tests running and haven't experienced this problem, although our test runs with much less load than yours. We will follow your description to verify. In the meantime, would it be possible for you to run your sample under valgrind? It will give much more details about where the failure might be.

valgrind --leak-check=full --track-origins=yes <executable>

Please compile the binary with debug information.

dndsmith commented 4 years ago

Hi @ewertons , Thanks for responding to our issue. Our case is an uncommon scenario, so again, thanks.

Here is a gist with the output from running valgrind on the sample. In my haste before, I neglected to check memory. So I had to add the following lines at the end of the while loop:

    IoTHubMessage_Destroy(iothubMsg->messageHandle);
    delete iothubMsg;
    sendCount++;
}

Also, it seems that it takes a couple minutes before the code segfaults. In our sample code, we run for 30,000 total messages at 100 msgs/sec, which is ~5 min. In another sample run with only 6,000 total messages also at 100 msgs/sec (~1 min), valgrind reports no segfault.

ewertons commented 4 years ago

That's great, @dndsmith , thank you very much for the level of detail you provided. We will root cause and reply back soon.

dndsmith commented 4 years ago

Hi @ewertons,

Glad the information helped. I have more information for you. It is either good or bad, depending on how you look at it.

We tested the program on a different machine to see if we got the same results. No segmentation fault. We ran the same experiment (30,000 msgs @ 100msgs/sec) and even doubled the number of messages to 60,000. No segmentation fault.

Development Machine, OS, Compiler Ubuntu 16.04 LTS, gcc 5.4.0, cmake 3.5.1. Versions are older because of Ubuntu 16.04, but we can try matching the gcc + cmake versions as we had before if that helps.

SDK Version I cloned the repository, so I believe that retrieves the latest version? (LTS_02_2020_Ref1?) I can try installing the 2020-01-22 (which is what we had before) to see if that would reproduce the error.

Protocol MQTT

ewertons commented 4 years ago

@dndsmith , sorry but we got reassigned to deal with new development, and follow ups got delayed. Apologies for that.

Well, the information you provided about Ubuntu matches what we verify on the gated builds we have.

However it still poses a question about why you saw the segmentation faults on 'Oracle Linux Server release 7.7 using gcc 8.2.0, cmake 3.16.2'. Just to try to eliminate any variability, could you check if you still get the same SEGFAULTS on Oracle Linux Server, and if hey go away using the latest release of the C SDK? https://github.com/Azure/azure-iot-sdk-c/releases/tag/LTS_07_2020_Ref01

dndsmith commented 4 years ago

Hi @ewertons ,

No worries. We have had successful experiments regardless.

In the meantime, I need to explain something. I use the word "experiments" because we have been doing research. I didn't find a need to mention it before, but the background will be useful now.

We programmed a synthetic data generator in C++ to model a device, and we utilized a high performance computing cluster so we could easily scale the number of devices. This is extremely unfortunate, but the cluster is changing their OS from Oracle Linux Server 7.7. to CentOS 8.2 throughout this week (bad timing, I know!). Thus, it would be difficult for us to test on Oracle Linux again unless we provisioned a VM in the cloud.

However, we could potentially test on CentOS 8.2 after the update is complete. If the code segfaults, such a failure could suggest the problem is related to the cluster (e.g. network security mechanisms or shared resources).

ewertons commented 4 years ago

@dndsmith ,

thanks for sharing the details. Looks like you folks are doing some very interesting things there. Very cool! We understand that conditions change, so it is ok. Of course if you could try the same on the new system and let us know that will help a lot. It will be a lot easier than us reserving resources to deploy an Oracle LS. For the sake of the time this issue has been open, do you believe you will be able to follow up with the results on the new system within one or two weeks? Otherwise, We'd ask if it is possible to close it and reopen it when you have more data.

Thanks, Azure SDK Team.

joehal9000 commented 4 years ago

Thank you for working with us on this. The new system will be up next week, so we will be able to follow up on this then. Of course feel free to close this until then.

joehal9000 commented 4 years ago

The server that we're using is still ongoing maintenance post updates. We're closing this issue for the sake of time and we'll reopen it once we are able to test the new system and offer new data.

az-iot-builder-01 commented 4 years ago

@dndsmith, @joehal9000, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey

joehal9000 commented 4 years ago

Development Machine, OS, Compiler

CentOS Linux release 8.2.2004 (Core) gcc 8.2.0, cmake 3.16.2

SDK Version

LTS_07_2020_Ref01

Protocol

MQTT

We ran the same sample program as before twice but using the new release on a CentOS 8 machine. Once sending 30,000 messages at 100 msgs/seconds, and another sending 60,000 messages at the same rate. Neither segfaulted.

The valgrind output from both runs are on this gist.

Thank you for working with us on this.

shantanu1singh commented 4 years ago

I believe I'm facing the same problem. I'm unable to repro the error when I run the executable using valgrind however. I'm sending 20k messages using the C SDK in quick succession.

Here's the error call stack I get when I run the executable after building it with debug information

Error: Time:Wed Aug 19 16:18:39 2020 File:/home/shantanu/repos/azure-iot-sdk-c/c-utility/src/vector.c Func:VECTOR_push_back Line:109 realloc failed.
Error: Time:Wed Aug 19 16:18:39 2020 File:/home/shantanu/repos/azure-iot-sdk-c/iothub_client/src/iothub_client_core.c Func:iothub_ll_event_confirm_callback Line:452 event confirm callback vector push failed.
double free or corruption (out)
Aborted

Any tips on how we can figure out what's going on?

Development Machine, OS, Compiler

Ubuntu 18.04 on WSL, gcc 7.5.0, cmake version 3.18.1

SDK Version

LTS_07_2020_Ref01

Protocol

MQTT

dndsmith commented 4 years ago

I only just saw this. Um, there's not much advice I can give you besides lowering the frequency at which you send messages. We received errors similar to that and resolved them by not sending as many messages a second (e.g. don't send 100 msgs/sec. 10 msgs/sec is more reasonable). We never fixed the issue causing segfaults. It could only be reproduced in our specific environment (which was on an HPC cluster with Oracle Linux). From our investigations the issue seemed to be that the vector in this client-side code was receiving callback events from IoT Hub faster than it could handle and therefore causing it to run out of memory.