Meshagent high memory usage / memory leak on RHEL9 / RHEL8 & Debian 10

deajan commented 2 years ago

Hello,

I have a couple of meshagents running on Linux which obviously have a memory leak, after just one day I get about 500MB memory usage per agent. After a couple of days, meshagent takes 2GB (maybe more, but then there's no more system memory and services begin to fail).

I have updated my meshcentral setup to latest git master yesterday v1.0.77 (was previously v1.0.6x), and still have the same memory leaks. Windows agents don't seem affected.

I have no clue what to provide to help debug. My meshagent.log file seems pretty empty:

[2022-09-01 08:33:46 PM] [F6A48178D7BCE798] microstack/ILibParsers.c:10923 (0,0) SelfUpdate -> Current Version: 2fc1af473a96b5ad64011fd0575cfa15ee36d769
[2022-09-01 08:33:50 PM] [F6A48178D7BCE798] microstack/ILibParsers.c:10923 (0,0) SelfUpdate -> Stopping Chain (1)

Here's the output of pmap:

 pmap 651
651:   /usr/local/mesh_services/meshagent/meshagent --installedByUser=0
0000000000400000   3488K r-x-- meshagent
0000000000967000    148K r---- meshagent
000000000098c000     20K rw--- meshagent
0000000000991000    156K rw---   [ anon ]
00000000022ec000 511528K rw---   [ anon ]
00007f198c99f000     12K r---- libgcc_s-11-20220127.so.1
00007f198c9a2000     72K r-x-- libgcc_s-11-20220127.so.1
00007f198c9b4000     12K r---- libgcc_s-11-20220127.so.1
00007f198c9b7000      4K ----- libgcc_s-11-20220127.so.1
00007f198c9b8000      4K r---- libgcc_s-11-20220127.so.1
00007f198c9b9000      4K rw--- libgcc_s-11-20220127.so.1
00007f198c9ba000     16K r---- libnss_myhostname.so.2
00007f198c9be000     60K r-x-- libnss_myhostname.so.2
00007f198c9cd000     48K r---- libnss_myhostname.so.2
00007f198c9d9000      4K ----- libnss_myhostname.so.2
00007f198c9da000     20K r---- libnss_myhostname.so.2
00007f198c9df000      4K rw--- libnss_myhostname.so.2
00007f198c9e0000      4K -----   [ anon ]
00007f198c9e1000   8212K rw---   [ anon ]
00007f198d1e6000    176K r---- libc.so.6
00007f198d212000   1496K r-x-- libc.so.6
00007f198d388000    336K r---- libc.so.6
00007f198d3dc000      4K ----- libc.so.6
00007f198d3dd000     12K r---- libc.so.6
00007f198d3e0000     12K rw--- libc.so.6
00007f198d3e3000     52K rw---   [ anon ]
00007f198d3f0000      4K r---- librt.so.1
00007f198d3f1000      4K r-x-- librt.so.1
00007f198d3f2000      4K r---- librt.so.1
00007f198d3f3000      4K r---- librt.so.1
00007f198d3f4000      4K rw---   [ anon ]
00007f198d3f5000      4K r---- libdl.so.2
00007f198d3f6000      4K r-x-- libdl.so.2
00007f198d3f7000      4K r---- libdl.so.2
00007f198d3f8000      4K r---- libdl.so.2
00007f198d3f9000      4K rw---   [ anon ]
00007f198d3fa000     60K r---- libm.so.6
00007f198d409000    448K r-x-- libm.so.6
00007f198d479000    360K r---- libm.so.6
00007f198d4d3000      4K r---- libm.so.6
00007f198d4d4000      4K rw--- libm.so.6
00007f198d4d5000      4K r---- libutil.so.1
00007f198d4d6000      4K r-x-- libutil.so.1
00007f198d4d7000      4K r---- libutil.so.1
00007f198d4d8000      4K r---- libutil.so.1
00007f198d4d9000      4K rw---   [ anon ]
00007f198d4da000      4K r---- libpthread.so.0
00007f198d4db000      4K r-x-- libpthread.so.0
00007f198d4dc000      4K r---- libpthread.so.0
00007f198d4dd000      4K r---- libpthread.so.0
00007f198d4de000     12K rw---   [ anon ]
00007f198d4e7000      8K r---- ld-linux-x86-64.so.2
00007f198d4e9000    152K r-x-- ld-linux-x86-64.so.2
00007f198d50f000     44K r---- ld-linux-x86-64.so.2
00007f198d51b000      8K r---- ld-linux-x86-64.so.2
00007f198d51d000      8K rw--- ld-linux-x86-64.so.2
00007ffd5cf57000    132K rw---   [ stack ]
00007ffd5cfb1000     16K r----   [ anon ]
00007ffd5cfb5000      8K r-x--   [ anon ]
ffffffffff600000      4K --x--   [ anon ]
 total           527248K

This is really bugging me since I need to restart agents everyday so they don't hog all memory.

System: Almalinux 9.0 Linux hyper-npf04.omni.local 5.14.0-70.22.1.el9_0.x86_64 #1 SMP PREEMPT Tue Aug 9 11:45:52 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux Happens both on physical and virtual servers.

Any insights perhaps ?

krayon007 commented 2 years ago

Can you go to the /usr/local/mesh_services/meshagent folder (or wherever your agent is installed), and run the following command:

sudo ./meshagent -state

That should return the list of open descriptors the agent thinks are active... Also from the console tab, run the command:

eval _debugGC() and let me know if the resource consumption drops at all.

Is this excessive resource consumption only happening on RHEL9 for you, not on other distros?

krayon007 commented 2 years ago

Also, on the agents that are leaking, can you run the following command, just so I can see what version is exactly running:

Either from /usr/local/mesh_services/meshagent run ./meshagent -info or from the console tab, run the command: versions

krayon007 commented 2 years ago

Actually, for the list of descriptors, you can also just run the fdsnapshot from the console tab, too...

deajan commented 2 years ago

Here are the outputs:

meshagent -state:

Querying Mesh Agent state...
Mesh Agent connected to: wss://remote.netpower.fr:443/agent.ashx
 Chain Timeout: 119952 milliseconds
 FD[12] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
 FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
 FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
 FD[19] (R: 0, W: 0, E: 0) => (stderr) childProcess (pid=537993), Remote Terminal
 FD[21] (R: 0, W: 0, E: 0) => (stdout) childProcess (pid=537993), Remote Terminal
 FD[13] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
 FD[20] (R: 0, W: 0, E: 0) => net.ipcServer.ipcSocketConnection
 FD[15] (R: 0, W: 0, E: 0) => net.ipcServer
 FD[16] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
 FD[14] (R: 0, W: 0, E: 0) => http.WebSocketStream, MeshAgent_relayTunnel
 FD[17] (R: 0, W: 0, E: 0) => net.socket
 FD[18] (R: 0, W: 0, E: 0) => http.WebSocketStream, MeshAgent_relayTunnel, Remote Terminal

 Timer: 19.2 minutes  (0x21d3db00) [setInterval(), meshcore (InfoUpdate Timer)]

meshagent -info:

Compiled on: 10:49:01, Aug 29 2022
   Commit Hash: f221183413f39aba155c75dbb85ff72777fc8244
   Commit Date: 2022-Aug-25 00:17:18-0700
Using OpenSSL 1.1.1q  5 Jul 2022
Agent ARCHID: 6
Detected OS: AlmaLinux 9.0 (Emerald Puma) - x64

On the console tab:

> eval _debugGC()
> versions
{
  "openssl": "1.1.1q",
  "duktape": "v2.6.0",
  "commitDate": "2022-08-25T07:17:18.000Z",
  "commitHash": "f221183413f39aba155c75dbb85ff72777fc8244",
  "compileTime": "10:49:00, Aug 29 2022"
}
> fdsnapshot
 Chain Timeout: 117752 milliseconds
 FD[12] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
 FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
 FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
 FD[13] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
 FD[15] (R: 0, W: 0, E: 0) => net.ipcServer
 FD[16] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
 FD[14] (R: 0, W: 0, E: 0) => http.WebSocketStream, MeshAgent_relayTunnel
 FD[17] (R: 0, W: 0, E: 0) => net.socket

deajan commented 2 years ago

At the time, I don't have an agent setup on earlier RHELs, but I'll have one setup into Debian 10 and RHEL 8 right now for the sakes of sanity.

krayon007 commented 2 years ago

I'm downloading AlmaLinux 9, to do some testing...

deajan commented 2 years ago

At the time, I don't have an agent setup on earlier RHELs, but I'll have one setup into Debian 10 and RHEL 8 right now for the sakes of sanity.

I've setup two other meshagent instances on AlmaLinux 8.6 and Debian 10.12 some minutes after the message above. Now, 12 hours later, those instances take 359M of memory on Debian and 355M of memory on AlmaLinux 8.6, so I think this is not related to RHEL9 only.

For reference (this is probably not relevant), here's my meshcentral config. Note that I use "commented" in JSON to emulate comments since JSON does not accept any form of comments. My meshcentral instance is behind a HaProxy. All node processes on server take about 380M together.

{
        "settings": {
                "Cert": "remote.example.tld",
                "Port": 8443,
                "RedirPort": 0,
                "AliasPort": 443,
                "TlsOffLoad": "127.0.0.1"
        },
        "domains": {
                "": {
                        "certUrl": "https://127.0.0.1:443/",
                        "title": "Some title",
                        "title2": "remote.example.tld",
                        "userQuota": 1048576,
                        "meshQuota": 248576,
                        "newAccounts": 0
                },
                "Customer1": {
                        "title": "Customer1",
                        "title2": "Extra string",
                        "newAccounts": 1
                }
        },
        "__commented__smtp": {
                "host": "mail.example.tld",
                "port": 25,
                "from": "infra@example.tldr",
                "__commented__user": "infra@example.tld",
                "__commented__pass": "mypass",
                "__commented__tls": false
        }
}

krayon007 commented 2 years ago

What types of things were you doing on the agents during those 12 hours? That way I can do more testing to try to replicate it. I was watching my two Ubuntu systems, doing terminal and KVM, but haven't seen memory usage go up after making several connections over and over.

On the console tab, if you select core, and clear core, does the memory usage go down?

deajan commented 2 years ago

I litteraly did nothing except installing them before going to bed yesterday. What do you mean by select and clear core ?

> coreinfo
{
  "action": "coreinfo",
  "caps": 14,
  "osdesc": "AlmaLinux 8.6 (Sky Tiger)",
  "root": true,
  "users": [],
  "value": "Aug 29 2022, 294730201"
}
> coredump
> info
Current Core: Aug 29 2022, 294730201
Agent Time: 2022-09-03 07:47:41.523+02:00.
User Rights: 0xffffffff.
Platform: linux.
Capabilities: 14.
Server URL: wss://remote.example.tld:443/agent.ashx.
OS: AlmaLinux 8.6 (Sky Tiger).
Modules: amt-apfclient, amt-lme, amt-manage, amt-mei, linux-dhcp, monitor-border, smbios, sysinfo, util-agentlog, wifi-scanner-windows, wifi-scanner.
Server Connection: true, State: 1.
X11 support: false.

deajan commented 2 years ago

What types of things were you doing on the agents during those 12 hours? That way I can do more testing to try to replicate it. I was watching my two Ubuntu systems, doing terminal and KVM, but haven't seen memory usage go up after making several connections over and over.

On the console tab, if you select core, and clear core, does the memory usage go down?

Okay I got it (my meshcentral was configured as french, so I didn't understand what you meant). I have selected clear core which has worked since it removed the terminal tab. That didn't change anything for the memory consumption of meshagent process, which is still really high (690M since yesterday).

krayon007 commented 2 years ago

Ok thanks. Since you said it was idle the whole time, that gives me a few ideas of where to start looking.

deajan commented 2 years ago

Keep me posted if perhaps I can help in someway, like running valgrind or so

deajan commented 2 years ago

Any news on this ? For now, I have to restart meshagent per cron every night so I don't get all my memory eaten on my servers.

In 3 days, meshagent ate 1.7G of memory :(

systemctl status meshagent
● meshagent.service - meshagent background service
     Loaded: loaded (/usr/lib/systemd/system/meshagent.service; enabled; vendor preset: disabled)
     Active: active (running) since Tue 2022-09-06 14:28:52 CEST; 3 days ago
   Main PID: 2642573 (meshagent)
      Tasks: 1 (limit: 101088)
     Memory: 1.7G
        CPU: 3h 47min 20.941s
     CGroup: /system.slice/meshagent.service
             └─2642573 /usr/local/mesh_services/meshagent/meshagent --installedByUser=0

Sep 06 14:28:52 myserver.local systemd[1]: Started meshagent background service.

krayon007 commented 2 years ago

Not yet, still trying to determine what is causing it, as I set aside two Ubuntu machines to track this down. They've been running for 2 days straight, but memory consumption is still sitting at 46mb.

deajan commented 2 years ago

Thank you for the update. I can provide you with an SSH access if needed.

krayon007 commented 2 years ago

Thank you for the update. I can provide you with an SSH access if needed.

This would actually be really helpful. I've been fiddling around with different scenarios that I thought would cause this, but I've been running into dead-ends... You can email me the details. I'd love to be able to get to the bottom of this.

deajan commented 2 years ago

This would actually be really helpful. I've been fiddling around with different scenarios that I thought would cause this, but I've been running into dead-ends... You can email me the details. I'd love to be able to get to the bottom of this.

Email sent.

krayon007 commented 2 years ago

@deajan thanks! By any chance are you able to email me your config.json? I setup a separate agent on your machine, and found that if I connected it to my server, it didn't leak... But if I connected it to your server it did... That means there has to be something in your server config that is causing the agent to do something differently when connected to your server versus mine.

deajan commented 2 years ago

Sent you the config.json file contents. If needed, I could perhaps grant you access to my server, after temporarily restricting access to other machines.

krayon007 commented 2 years ago

I figured out what was going on... There are two main issues going on... The first issue, is that your HAProxy appears to be dropping idle connections after 60 seconds. Your server however, has the default configuration, so that the agent only sends keep alives every 120 seconds, resulting in the control channel getting disconnected/reconnected every 60 seconds. I found that upon reconnection, the agent makes a "sysinfo" call. This call would cause the memory usage to accumulate.

I fixed the agent so that the sysinfo call doesn't cause this problem, but that will require an agent update... However, in your case, I was also able to resolve the memory usage issue by configuring the agent's idle timeout to 50 seconds, so that it sends keep alives every 50 seconds. I configured the agent to do this on your system, and the agent's memory usage according to pmap has been holding steady at 22mb.

If you add the following line in the "settings" section of your config.json:

"agentpong": 50
"browserpong": 50

this should fix the main issue for you, as all the agents that connect, will be configured with the proper keep-alives for your HAProxy setup... Alternatively, you could reconfigure your HAProxy to have a 2 minute idle disconnect interval instead of 1 minute.

deajan commented 2 years ago

@krayon007 Fantastic news, thank you alot. Now that the memory leak itself is healed, I would like to ask you whether a client keep alive every 2 minutes is enough. Haproxy (as well as redhat in https://access.redhat.com/solutions/5357081) recommends setting client timeouts to 30 seconds for the proxy.

Should meshagent keep alive interval be lowered, or perhaps should it be added to meshcentral documentation that haproxy requires timeout client 130 ?

As a side question, I've updated my config.json file with your settings above so I don't have to extend my haproxy timeout. Is there an easy way to restart all the agents so I don't have to relaunch them agent by agent ?

krayon007 commented 2 years ago

@deajan simply restarting your server should be sufficient for the agents to capture the new settings. As far as restarting the agent, you can do that server side on each agent, but let me check to see if there is a group action for that.

krayon007 commented 2 years ago

Regarding your question on the timeout value itself, more documentation is always a good thing. Is your HAProxy using it's default timeout values? If so, and if 30 seconds is the recommended client value with the proxy, then we should add documentation reflecting that in the section that describes configuring a reverse proxy. I will also add documentation about this in the agent side.

On a side note, @deajan I assume you meant you modified your config to set the clients timeout to 30 not 130? 120 is the default, so 130 would be too long with your current HAProxy setup, as I had to lower it to 50 for it to prevent disconnects

deajan commented 2 years ago

There is no really a default value for timeout client. Just most online docs about HaProxy adivse to setting this to 30s to maximize security, see this or this. I guess the the recommended short value is there to avoid overusage of open connections type attacks (I remember something like slowloris doing this).

I meant timeout client 130 in HaProxy so it waits for the meshagent client pinging every 2 minutes.

deajan commented 2 years ago

Just made a fresh Haproxy 2.4 install on RHEL9, the default timeout client value is 1m. So I guess the agentPong value should definitly be below that value, or documented to produce timeouts with haproxy.

Ylianst / MeshAgent

Meshagent high memory usage / memory leak on RHEL9 / RHEL8 & Debian 10 #151