Memory Leak in edgeHub module

BewaAutomatisierung-RD commented 5 years ago

Expected Behavior

The edgeHub module should be able to run for a long time and receive tons of messages without increasing memory consumption.

Current Behavior

Memory consumption of the edgeHub module increases constantly if messages are handled. Before Handling messages the memory consumption of edgeHub is about 115 MB. After 10.000 messages generated by tempSensorModule and passed through a SampleModule just piping the messages to IotHub, the memory consumption of edgeHub is about 150 MB. It does not decrease again after stopping sending messages but stays the same. Memory consumption of tempSensor and SampleModule increase just a little bit temporarily but will always decrease again.

The increasing Memory consumption occurs on both a Raspberry Pi and an ARM64 Virtual Machine on Azure. On the Raspberry Pi the edgeHub crashes at some point with std::badAlloc when the address space is exceeded.

Steps to Reproduce

  1. Create new IotEdge Solution in Visual Studio Code using the C# Module template
  2. Set MessageCount of tempSensor to 10000 and MessageDelay to "00:00:00.010"
  3. Deploy to Device

Context (Environment)

Device (Host) Operating System


Container Operating System

Runtime Versions


Edge Agent

Edge Hub



Additional Information

darobs commented 5 years ago

Hello @BewaControl-ReneDivossen

Some of that memory is probably pooled buffer allocation. The EdgeHub will allocate larger and larger pools of buffer memory as queues fill. I am not certain when or if those memory pools will be garbage collected, because it may continue to use these larger pools for quite a while.

One thing you definitely do not want to do is let EdgeHub allocate very large buffer pools on the Raspberry Pi. If you are not setting "OptimizeForPerformance" to false on Pi deployments, you may see EdgeHub try to allocate large chunks of memory the system cannot accommodate. Please double-check that.

So, I was talking with my teammates, and we have some ideas which could alleviate memory pressure. We're going to follow up with the dotnet team to see if there are memory management options we can set, especially for constrained systems like the Pi.

However, it's also possible that we have a memory leak - so I'm going to put a task to check for memory leaks in our backlog.

BewaAutomatisierung-RD commented 5 years ago

Hello @darobs

The "OptimizeForPerformance" setting is set to false (on both the Raspberry and the VirtualMachine). We also use a custom storage path for messages. Here is the deployment setting for the sample application for reproduction of the issue:

  "modulesContent": {
    "$edgeAgent": {
      "properties.desired": {
        "schemaVersion": "1.0",
        "runtime": {
          "type": "docker",
          "settings": {
            "minDockerVersion": "v1.25",
            "loggingOptions": "",
            "registryCredentials": {
              // removed from posting
        "systemModules": {
          "edgeAgent": {
            "type": "docker",
            "settings": {
              "image": "",
              "createOptions": "{}"
          "edgeHub": {
            "type": "docker",
            "status": "running",
            "restartPolicy": "always",
            "settings": {
              "image": "",
              "createOptions": "{\"HostConfig\":{\"Binds\":[\"/etc/iotedge/storage/:/iotedge/storage/\"],\"PortBindings\":{\"5671/tcp\":[{\"HostPort\":\"5671\"}],\"8883/tcp\":[{\"HostPort\":\"8883\"}],\"443/tcp\":[{\"HostPort\":\"443\"}]}}}"
            "env": {
              "OptimizeForPerformance": {
                "value": "false"
              "storageFolder": {
                "value": "/iotedge/storage/"
        "modules": {
          "tempSensor": {
            "version": "1.0",
            "type": "docker",
            "status": "running",
            "restartPolicy": "always",
            "settings": {
              "image": "",
              "createOptions": "{}"
            "env": {
              "MessageCount": {
                "value": 10000
              "MessageDelay": {
                "value": "00:00:00.010"
          "SampleModule": {
            "version": "1.0",
            "type": "docker",
            "status": "running",
            "restartPolicy": "always",
            "settings": {
              "image": "",
              "createOptions": "{}"
    "$edgeHub": {
      "properties.desired": {
        "schemaVersion": "1.0",
        "routes": {
          "SampleModuleToIoTHub": "FROM /messages/modules/SampleModule/outputs/* INTO $upstream",
          "sensorToSampleModule": "FROM /messages/modules/tempSensor/outputs/temperatureOutput INTO BrokeredEndpoint(\"/modules/SampleModule/inputs/input1\")"
        "storeAndForwardConfiguration": {
          "timeToLiveSecs": 7200

We are reaching a point where we have to question whether IotEdge can really be used in real life industrial use-cases. Our machines will be sending messages every 2 seconds and they have to be able to run 24/7 for long periods of time without worrying about loosing data. I'm under time pressure to get the messages to IotHub reliably so it would be great to get an estimation soon, if there is a way to use iotedge in real-world industry scenarios and if yes, which hardware and configuration has to be used to make it work.

I just found a report from March 18 stating there probably is a memory leak:

Please let me know as soon as possible if there really is an issue and if it can be fixed soon. If it can't I'll have to look for an alternative. I'm still hoping that there is a configuration that makes it all work!

Thanks, René

BewaAutomatisierung-RD commented 5 years ago

According to a recommendation by @varunpuranik in this thread i turned off AMQP. The EdgeHub then starts with about 57 MB of memory consumption and after 10000 messages consumes about 97 MB. This means about 4KB per Message are allocated and never released. This is about the same as with AMQP and MQTT both turned on (increase from about 115 MB to 150 MB).
I also did another test with MQTT turned off and only using AMQP. The memory consumption increased as well. It just started higher (at about 106 MB).

varunpuranik commented 5 years ago

@BewaControl-ReneDivossen - We have long haul tests that run fine with workloads similar to what you have for several days (we run each test for 7 days). We have also done testing where we have been able to run IoT Edge for weeks without any issues under similar loads. As for the memory consumption - the memory usage of EdgeHub when both protocols are enabled typically puts a lot of pressure on a device like Raspberry PI. But with one protocol head turned off, the memory pressure goes down as you see. The increase in memory usage is not necessarily a leak, but just the way the Dotnet GC behaves, allowing the app to consume more memory when available. But the expectation is that this should stabilize at some point, after which the EdgeHub should be able to run steadily for long periods of time. It would be great if you can test this. Meanwhile as @darobs suggested, we will once again check EdgeHub for any memory leaks. We are also looking at tweaking certain GC settings to see if we can get an even better and more reliable performance from EdgeHub.

myagley commented 5 years ago

Are you actually seeing issues with message loss, errors, etc? Or, is this just a comment on the rising memory use? If there are issues with message loss or other errors, is it possible to get more detail?

As @darobs mentions above, the Edge Hub is implemented with .NET, which is a managed runtime with a garbage collector. Its memory use will not look like a native process with manual memory mangement. Rising memory in the Edge Hub does not mean a memory leak. As you mention, there was a true memory leak in the Edge Hub at the beginning of 2018, but this has been fixed.

There is an issue on constrained devices with 32 bit address space with a bad_alloc and we are looking at ways to tune the dotnet garbage collection to optimize for memory footprint instead of collection performance on these platforms. This means collecting more often which has negative performance impact.

If there are real concerns about the Edge Hub using all of the memory on the device, you can constrain its memory use in the createOptions when deploying. This configuration can be provided if you are interested. With your message rate, you should be prepared to give the Edge Hub at least 1Gb of memory.

BewaAutomatisierung-RD commented 5 years ago

With turning off protocol heads as described by @varunpuranik in this thread it seems to work alright. Memory is rising for a while but then seems to stabilize at some point. Short term tests with only MQTT enabled (3 days) and only AMQP enabled (1 day) produced no more bad_alloc crashes. If I see any further problems when doing long term tests, I'll get back to you. I'll also keep this thread subscribed to see if you find any more memory issues and/or if you find a way to get rid of the bad_alloc errors even with all protocol heads enabled.

I also mentioned this in the other thread but I'll copy this here: To prevent other developers from becoming frustrated you should comment on the necessity/recommendation to turn off Protocol heads on arm32 devices in your Installation Guide on your Troubleshooting Page. So far Turning off Protocol Heads seems to be an undocumented feature.

veyalla commented 5 years ago

Thank you for the feedback. The turning off of protocol heads is documented in the production guide:

If you're concerned about resource usage, it is also possible to set memory limits on the container via module create options.

I agree that this would be good info to have in the troubleshooting section for resource constrained devices. I'll update the page.

levi106 commented 5 years ago

I got a similar problem when I disabled outbound network to IoT Hub. If my understanding is correct, Visual Studio shows only live objects, so these objects will not be collected by GC.

appinsights dump

It seems that one of the causes of the memory leak is that the event handler ConnectivityAwareClient.HandleDeviceConnectedEvent will not be removed from DeviceConnectivityManager.DeviceConnected.

Found 1 unique roots (run '!gcroot -all' to see all roots).
0:000> !do 000001294d24a4c0
Name:        Microsoft.Azure.Devices.Edge.Hub.CloudProxy.DeviceConnectivityManager
MethodTable: 00007ff98a6fcb30
EEClass:     00007ff98a73cef8
Size:        80(0x50) bytes
File:        C:\app\Microsoft.Azure.Devices.Edge.Hub.CloudProxy.dll
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff98b167e48  400003e        8 ....Hub.CloudProxy]]  0 instance 000001294d24b500 machine
00007ff98a763998  400003f       10  System.Timers.Timer  0 instance 000001294d24b350 connectedTimer
00007ff98a763998  4000040       18  System.Timers.Timer  0 instance 000001294d24b3e8 disconnectedTimer
00007ff98b149208  4000041       20 ...dentity.IIdentity  0 instance 000001294d24a3d8 testClientIdentity
00007ff98a6fc9d8  4000042       40         System.Int32  1 instance                2 state
00007ff98b1f1750  4000043       28 ...nnectivityChecker  0 instance 000001294d256ea0 connectivityChecker
00007ff9dee45f30  4000044       30  System.EventHandler  0 instance 0000012951807dc0 DeviceConnected
00007ff9dee45f30  4000045       38  System.EventHandler  0 instance 0000012951807e40 DeviceDisconnected
0:000> !do 0000012951807dc0
Name:        System.EventHandler
MethodTable: 00007ff9dee45f30
EEClass:     00007ff9de5351a0
Size:        64(0x40) bytes
File:        C:\Program Files\dotnet\shared\Microsoft.NETCore.App\2.1.6\System.Private.CoreLib.dll
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9dee80850  4000295        8        System.Object  0 instance 0000012951807dc0 _target
00007ff9dee80850  4000296       10        System.Object  0 instance 0000000000000000 _methodBase
00007ff9dee971f8  4000297       18        System.IntPtr  1 instance     7ff98a574090 _methodPtr
00007ff9dee971f8  4000298       20        System.IntPtr  1 instance     7ff9de5873c0 _methodPtrAux
00007ff9dee80850  40002a2       28        System.Object  0 instance 00000129516e2748 _invocationList
00007ff9dee971f8  40002a3       30        System.IntPtr  1 instance              815 _invocationCount
0:000> !do 00000129516e2748
Name:        System.Object[]
MethodTable: 00007ff9dee63878
EEClass:     00007ff9de544600
Size:        32792(0x8018) bytes
Array:       Rank 1, Number of elements 4096, Type CLASS
levi106 commented 5 years ago

If failed to connect to the IoT Hub, CloudConnectionProvider.Connect method returns error.

2019-06-07 14:50:44.690 +00:00 [WRN] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection] - Error creating cloud connection for client myEdgeDevice/$edgeHub
2019-06-07 14:51:05.730 +00:00 [WRN] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection] - Error creating cloud connection for client myEdgeDevice/CSharpModule
2019-06-07 14:52:26.774 +00:00 [WRN] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection] - Error creating cloud connection for client myEdgeDevice/CSharpModule
2019-06-07 14:52:47.801 +00:00 [WRN] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection] - Error creating cloud connection for client myEdgeDevice/$edgeHub
2019-06-07 14:53:08.829 +00:00 [WRN] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection] - Error creating cloud connection for client myEdgeDevice/CSharpModule

Therefore, ConnectionManager.GetOrCreateCloudConnection will call CloudConnectionProvider.Connect method every time. However, ConnectivityAwareClient.HandleDeviceConnectedEvent will not be removed from DeviceConnectivityManager.DeviceConnected, ConnectivityAwareClient objects and its associated objects which are created by CloudConnectionProvider will continue to leak.

varunpuranik commented 5 years ago

@levi106 - Thanks for reporting this. I will look into this asap.

levi106 commented 5 years ago

@varunpuranik Is there any update?

varunpuranik commented 5 years ago

@levi106 - The leaking of event handlers has been fixed in master - It will be part of the 1.0.9 release.

levi106 commented 5 years ago

I see. Thanks.

lt72 commented 4 years ago

Closing this issue as it seems resolved. Please re-open as needed.