Azure / iotedge

The IoT Edge OSS project
MIT License
1.46k stars 459 forks source link

Memory Leak in EdgeHub module #6997

Open KhantNayan opened 1 year ago

KhantNayan commented 1 year ago

Expected Behavior

Memory utilization by EdgeHub module should be constant

Current Behavior

Memory utilization by EdgeHub module is increasing

Steps to Reproduce

  1. Run EdgeHub module for 2-3 days
  2. Observe memory utilization of the EdgeHub module
  3. Restart EdgeHub module and check memory utilization

Context (Environment)

Output of iotedge check

damen@damen:~$ sudo iotedge check [sudo] password for damen: Configuration checks (aziot-identity-service) --------------------------------------------- √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK ‼ aziot-identity-service package is up-to-date - Warning Installed aziot-identity-service package has version 1.4.1 but 1.4.3 is the latest stable version available. Please see https://aka.ms/aziot-update-runtime for update instructions. √ host time is close to reference time - OK √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- √ host can connect to and perform TLS handshake with iothub AMQP port - OK √ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with iothub MQTT port - OK Configuration checks -------------------- √ aziot-edged configuration is well-formed - OK √ configuration up-to-date with config.toml - OK √ container engine is installed and functional - OK √ configuration has correct URIs for daemon mgmt endpoint - OK ‼ aziot-edge package is up-to-date - Warning Installed IoT Edge daemon has version 1.4.3 but 1.4.9 is the latest stable version available. Please see https://aka.ms/iotedge-update-runtime for update instructions. √ container time is close to host time - OK √ DNS server - OK ‼ production readiness: logs policy - Warning Container engine is not configured to rotate module logs which may cause it run out of disk space. Please see https://aka.ms/iotedge-prod-checklist-logs for best practices. You can ignore this warning if you are setting log policy per module in the Edge deployment. √ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK √ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK √ Agent image is valid and can be pulled from upstream - OK √ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK Connectivity checks ------------------- √ container on the default network can connect to upstream AMQP port - OK √ container on the default network can connect to upstream HTTPS / WebSockets port - OK √ container on the IoT Edge module network can connect to upstream AMQP port - OK √ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK 32 check(s) succeeded. 3 check(s) raised warnings. Re-run with --verbose for more details. 2 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details.

Device Information

Runtime Versions

Additional Information

10-20 MB increasing at every 4 hours Below are the memory utilization be Edge Hub

******************************************
*   Mon Mar 13 16:00:01 UTC 2023   *
******************************************
CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
737343ad5d99   devicemonitor            1.11%     1.906MiB / 500MiB     0.38%     27.4kB / 30.1kB   0B / 0B          3
3d3b2f853eb8   edgeHub                  0.04%     163.4MiB / 500MiB     32.68%    26.6MB / 32.6MB   4.1kB / 2.45GB   25
d3cc99067218   devicemanagement         1.11%     2.23MiB / 500MiB      0.45%     0B / 0B           0B / 0B          3
c6025fa13fcc   edgeAgent                0.00%     51.72MiB / 500MiB     10.34%    561kB / 600kB     8.19kB / 291kB   18
******************************************
*   Mon Mar 13 20:00:01 UTC 2023   *
******************************************
CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
737343ad5d99   devicemonitor            1.11%     1.91MiB / 500MiB      0.38%     55.4kB / 64.3kB   0B / 0B          3
3d3b2f853eb8   edgeHub                  0.15%     182.7MiB / 500MiB     36.54%    38.7MB / 49.9MB   4.1kB / 3.31GB   25
d3cc99067218   devicemanagement         1.14%     2.223MiB / 500MiB     0.44%     0B / 0B           0B / 0B          3
c6025fa13fcc   edgeAgent                0.00%     50.85MiB / 500MiB     10.17%    1.21MB / 1.32MB   8.19kB / 537kB   17
******************************************
*   Tue Mar 14 00:01:01 UTC 2023   *
******************************************
CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
737343ad5d99   devicemonitor            1.10%     1.914MiB / 500MiB     0.38%     78.5kB / 92.2kB   0B / 0B          3
3d3b2f853eb8   edgeHub                  0.04%     197.6MiB / 500MiB     39.52%    48.3MB / 63.8MB   4.1kB / 3.99GB   24
d3cc99067218   devicemanagement         1.16%     2.227MiB / 500MiB     0.45%     0B / 0B           0B / 0B          3
c6025fa13fcc   edgeAgent                1.37%     51.69MiB / 500MiB     10.34%    1.72MB / 1.9MB    8.19kB / 733kB   18
******************************************
*   Tue Mar 14 04:01:01 UTC 2023   *
******************************************
CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
737343ad5d99   devicemonitor            1.11%     1.922MiB / 500MiB     0.38%     101kB / 120kB     0B / 0B          3
3d3b2f853eb8   edgeHub                  0.06%     211.9MiB / 500MiB     42.38%    58MB / 77.8MB     4.1kB / 4.68GB   26
d3cc99067218   devicemanagement         1.07%     2.23MiB / 500MiB      0.45%     0B / 0B           0B / 0B          3
c6025fa13fcc   edgeAgent                0.00%     51.9MiB / 500MiB      10.38%    2.23MB / 2.47MB   8.19kB / 930kB   18
******************************************
vipeller commented 1 year ago

Hi @KhantNayan, sorry for the late answer. Do you see this growing indefinitely (so it never stops)? We have long haul tests and we did not notice this, but let me it up and see. I would think that it is related to caching and the memory gets freed when it is scarce.

KhantNayan commented 1 year ago

Thank you @vipeller for the response. This memory utilization is growing continuously and we set the memory limitation to 500 MB, So the OutOfMemory killer is invoking at that limit. And due to OOM, sometimes system is misbehaving like not connected to iothub, threw exception in EdgeHub/EdgeAgent module

KhantNayan commented 1 year ago

Attached support bundle support_bundle_2023_05_08_16_03_10_UTC.zip

vipeller commented 1 year ago

@KhantNayan Hi, I see from the support bundle that there are many modules. Can you give me a hint about the message pattern, e.g. that these modules use module to module messages and what message size/message rate? I don't need the exact schema, just want to setup some similar test, I want to find the spot where it leaks, we don't see leaking in out long running tests

github-actions[bot] commented 1 year ago

This issue is being marked as stale because it has been open for 30 days with no activity.

spark-iiot commented 1 year ago

Microsoft has reproduced this problem on multiple occasions without succeeding in resolving it.

spark-iiot commented 1 year ago

Any updates on this incident ?

vipeller commented 1 year ago

Hi @spark-iiot, this issue is not actively being investigated. I asked some additional information on May 8th about your setup, so I could run a test with similar module number/behavior.

We have long running tests and those don't show memory leak. Those send through several 10 thousands of messages through several days.

We had memory leak problems in the past, resulted by different bugs, RocksDB, etc - some of them were found and fixed, others were worked around. These were triggered by specific use cases.

Without knowing your use case, I will not be able to repro it and see what may cause this. I need to run something similar than you do, so then I can check with memory profiling that what holds on the memory.

Please, give some information what you actually do:

Let me know your setup better, so we can create better tests to repro