Ylianst / MeshAgent

MeshAgent used along with MeshCentral to remotely manage computers. Many variations of the background management agent are included as binaries in the MeshCentral project.
https://meshcentral.com
210 stars 82 forks source link

Agent memory leak when mesh server unreachable #110

Open sunshineco opened 2 years ago

sunshineco commented 2 years ago

I manage an installation in which the MeshCentral server is run only when a remote machine needs to be accessed, and it is not uncommon for weeks or even months to pass between such occasions. As such, with the server unreachable for weeks or months (since it's not running), the agents repeatedly try to contact the server. Unfortunately, some of the agents leak memory (presumably) upon each attempt to contact the server, with the result that the agent eventually consumes all memory. I have seen an agent leak hundreds of megabytes in less than a day, and gigabytes within several days. The problem has been afflicting meshagent installed on Ubuntu 20.04.

I was able to reproduce the problem easily using a virgin installation of Xubuntu 20.04 (with all software updates applied) in VirtualBox 6.1.26 with only VirtualBox Guest Additions installed.

% lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:    20.04
Codename:   focal
% uname -a
Linux xubuntu 5.11.0-37-generic #41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

The agent begins leaking memory as soon as the MeshCentral server is stopped, presumably each time it tries to reconnect to the server. In the short time I monitored it closely, I saw leaks ranging in size from 153 to 628 bytes per connection attempt.

MeshCentral version is 0.9.28. Agent information which Bryan has requested in other similar bug reports:

> fdsnapshot
 Chain Timeout: 120405 milliseconds
 FD[13] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
 FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
 FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
 FD[12] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
 FD[14] (R: 0, W: 0, E: 0) => net.ipcServer
 FD[16] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
> timerinfo
 Timer: 19.9 minutes  (0x1198710) [setInterval(), meshcore (InfoUpdate Timer)]
> info
Current Core: Sep 10 2021, 3345885757
Agent Time: 2021-10-07 03:37:08.960-04:00.
User Rights: 0xffffffff.
Platform: linux.
Capabilities: 15.
Server URL: wss://[redacted]/agent.ashx.
OS: Ubuntu 20.04.3 LTS.
Modules: amt-apfclient, amt-lme, amt-manage, amt-mei, linux-dhcp, monitor-border, smbios, sysinfo, util-agentlog, wifi-scanner-windows, wifi-scanner, routeplus.
Server Connection: true, State: 1.
X11 support: true.

I have not been able to reproduce this problem with agents running on Windows or Artix Linux.

There have been other reports of agent memory leaks which have been fixed:

However, the one being reported here can be reproduced so easily with a virgin Ubuntu/Xubuntu installation that it seemed prudent to report it separately rather than mix it with the existing reports in which the conversations may have have gone off on different tangents.

krayon007 commented 2 years ago

A little while ago, I fixed a bug I found where the agent leaked a few bytes everytime it retried a connection to the server. May be the exact issue that was reported here... I'm trying to dig in the logs to see when that fix was added.

sunshineco commented 2 years ago

Hi Bryan,

In the links cited above, I did see mention of some leak fixes, however, the problem reported here is still present. To verify that the problem is ongoing, I followed the reproduction recipe given above and just now performed a clean Xubuntu 20.04 install as guest in a VirtualBox VM, and installed MeshAgent for Linux in the VM. As soon as I killed off the MeshCentral server, the agent began leaking memory, and continued leaking on each attempt to reestablish the connection to the server. The leak only stopped when I restarted the server and the agent was able to reconnect. (For what it's worth, the leaked memory in the agent remained leaked; it was never released.) This testing was performed with up-to-date MeshCentral and MeshAgent. As noted in the original report, I've only seen this leak on Ubuntu, which may provide a clue (or not).

MeshCentral version information:

> info
{
    "meshVersion": "v1.0.33",
    "nodeVersion": "v16.15.1",
    "runMode": "WAN mode",
    "productionMode": true,
    "database": "NeDB",
    "plugins": [],
    "platform": "darwin",
    "arch": "x64",
    "pid": 26161,
    "uptime": 608.131273148,
    "cpuUsage": {
        "user": 4090704,
        "system": 453528
    },
    "memoryUsage": {
        "rss": 133963776,
        "heapTotal": 33112064,
        "heapUsed": 31359408,
        "external": 30218162,
        "arrayBuffers": 29179698
    },
    "warnings": [],
    "allDevGroupManagers": []
}

MeshAgent version information:

> info
Current Core: Apr 4 2022, 419748901
Agent Time: 2022-06-08 00:55:25.204-04:00.
User Rights: 0xffffffff.
Platform: linux.
Capabilities: 15.
Server URL: wss://[redacted]/agent.ashx.
OS: Ubuntu 20.04.4 LTS.
Modules: amt-apfclient, amt-lme, amt-manage, amt-mei, linux-dhcp, monitor-border, smbios, sysinfo, util-agentlog, wifi-scanner-windows, wifi-scanner.
Server Connection: true, State: 1.
X11 support: true.
sunshineco commented 2 years ago

For the record, I also just performed a clean install of Xubuntu 22.04 as guest in a VirtualBox VM and observe the same MeshAgent leak when the MeshCentral server becomes unreachable.

MeshAgent version information:

> info
Current Core: Apr 4 2022, 419748901
Agent Time: 2022-06-08 01:47:47.123-04:00.
User Rights: 0xffffffff.
Platform: linux.
Capabilities: 15.
Server URL: wss://[redacted]/agent.ashx.
OS: Ubuntu 22.04 LTS.
Modules: amt-apfclient, amt-lme, amt-manage, amt-mei, linux-dhcp, monitor-border, smbios, sysinfo, util-agentlog, wifi-scanner-windows, wifi-scanner.
Server Connection: true, State: 1.
X11 support: true.
krayon007 commented 2 years ago

I'm working on a test script that tests a few things with the control channel, so I will certainly look into this scenario.

sunshineco commented 2 years ago

Thanks for investigating!

rafaelreis-r commented 1 year ago

Can confirm this is happening on Ubuntu 18.04 for me. Gradually, swap space is dominated by meshagent.

krayon007 commented 1 year ago

Are you seeing this when the agent is connected, or disconnected?

rafaelreis-r commented 1 year ago

Are you seeing this when the agent is connected, or disconnected?

Happens both on connected and disconnected agents. It creeps slowly takes about 2weeks to take over 8GB of vmem

krayon007 commented 1 year ago

Do you have a reverse proxy? I was helping someone else and found that their HAproxy was set with a 60 second idle timeout, but the mesh server was configured with the default 120 second idle timeout. I found a bug in the agent that was leaking a small amount of memory on reconnect. I fixed that issue, but it will require Ylian to update the agent when he gets back. In the meantime, we were able to get the agent to not leak by configuring his reverse proxy for a 2 minute idle timeout, so that way the agent wasn't periodically disconnecting and reconnecting.

rafaelreis-r commented 1 year ago

Do you have a reverse proxy? I was helping someone else and found that their HAproxy was set with a 60 second idle timeout, but the mesh server was configured with the default 120 second idle timeout. I found a bug in the agent that was leaking a small amount of memory on reconnect. I fixed that issue, but it will require Ylian to update the agent when he gets back. In the meantime, we were able to get the agent to not leak by configuring his reverse proxy for a 2 minute idle timeout, so that way the agent wasn't periodically disconnecting and reconnecting.

Yes! I do!

I run caddy as reverse proxy and use Cloudflare as another proxy layer (orange cloud enabled). This might be a similar situation.

Is there anything I can investigate on my setup that would help you guys?

rafaelreis-r commented 1 year ago

@krayon007

I spun up an aws ec2 free tier on Ubuntu 22.04. Installed meshagent and telegraf monitoring. This graph shows precisely the memory creep: memcreep

Note that it goes on until it takes over the instance memory (1GB) and the instance crashes. After I rebooted it, it started creeping again.

I see this behavior on multiple bare metal and virtualized Ubuntu Server hosts.

All those instances connect to my reverse proxied server (caddy) behind cloudflare.

Issue is real and occurs on latest version.