Open deajan opened 2 years ago
Can you go to the /usr/local/mesh_services/meshagent
folder (or wherever your agent is installed), and run the following command:
sudo ./meshagent -state
That should return the list of open descriptors the agent thinks are active... Also from the console tab, run the command:
eval _debugGC()
and let me know if the resource consumption drops at all.
Is this excessive resource consumption only happening on RHEL9 for you, not on other distros?
Also, on the agents that are leaking, can you run the following command, just so I can see what version is exactly running:
Either from /usr/local/mesh_services/meshagent
run ./meshagent -info
or from the console tab, run the command: versions
Actually, for the list of descriptors, you can also just run the fdsnapshot
from the console tab, too...
Here are the outputs:
meshagent -state
:
Querying Mesh Agent state...
Mesh Agent connected to: wss://remote.netpower.fr:443/agent.ashx
Chain Timeout: 119952 milliseconds
FD[12] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
FD[19] (R: 0, W: 0, E: 0) => (stderr) childProcess (pid=537993), Remote Terminal
FD[21] (R: 0, W: 0, E: 0) => (stdout) childProcess (pid=537993), Remote Terminal
FD[13] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
FD[20] (R: 0, W: 0, E: 0) => net.ipcServer.ipcSocketConnection
FD[15] (R: 0, W: 0, E: 0) => net.ipcServer
FD[16] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
FD[14] (R: 0, W: 0, E: 0) => http.WebSocketStream, MeshAgent_relayTunnel
FD[17] (R: 0, W: 0, E: 0) => net.socket
FD[18] (R: 0, W: 0, E: 0) => http.WebSocketStream, MeshAgent_relayTunnel, Remote Terminal
Timer: 19.2 minutes (0x21d3db00) [setInterval(), meshcore (InfoUpdate Timer)]
meshagent -info
:
Compiled on: 10:49:01, Aug 29 2022
Commit Hash: f221183413f39aba155c75dbb85ff72777fc8244
Commit Date: 2022-Aug-25 00:17:18-0700
Using OpenSSL 1.1.1q 5 Jul 2022
Agent ARCHID: 6
Detected OS: AlmaLinux 9.0 (Emerald Puma) - x64
On the console tab:
> eval _debugGC()
> versions
{
"openssl": "1.1.1q",
"duktape": "v2.6.0",
"commitDate": "2022-08-25T07:17:18.000Z",
"commitHash": "f221183413f39aba155c75dbb85ff72777fc8244",
"compileTime": "10:49:00, Aug 29 2022"
}
> fdsnapshot
Chain Timeout: 117752 milliseconds
FD[12] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
FD[13] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
FD[15] (R: 0, W: 0, E: 0) => net.ipcServer
FD[16] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
FD[14] (R: 0, W: 0, E: 0) => http.WebSocketStream, MeshAgent_relayTunnel
FD[17] (R: 0, W: 0, E: 0) => net.socket
At the time, I don't have an agent setup on earlier RHELs, but I'll have one setup into Debian 10 and RHEL 8 right now for the sakes of sanity.
I'm downloading AlmaLinux 9, to do some testing...
At the time, I don't have an agent setup on earlier RHELs, but I'll have one setup into Debian 10 and RHEL 8 right now for the sakes of sanity.
I've setup two other meshagent instances on AlmaLinux 8.6 and Debian 10.12 some minutes after the message above. Now, 12 hours later, those instances take 359M of memory on Debian and 355M of memory on AlmaLinux 8.6, so I think this is not related to RHEL9 only.
For reference (this is probably not relevant), here's my meshcentral config. Note that I use "commented" in JSON to emulate comments since JSON does not accept any form of comments. My meshcentral instance is behind a HaProxy. All node processes on server take about 380M together.
{
"settings": {
"Cert": "remote.example.tld",
"Port": 8443,
"RedirPort": 0,
"AliasPort": 443,
"TlsOffLoad": "127.0.0.1"
},
"domains": {
"": {
"certUrl": "https://127.0.0.1:443/",
"title": "Some title",
"title2": "remote.example.tld",
"userQuota": 1048576,
"meshQuota": 248576,
"newAccounts": 0
},
"Customer1": {
"title": "Customer1",
"title2": "Extra string",
"newAccounts": 1
}
},
"__commented__smtp": {
"host": "mail.example.tld",
"port": 25,
"from": "infra@example.tldr",
"__commented__user": "infra@example.tld",
"__commented__pass": "mypass",
"__commented__tls": false
}
}
What types of things were you doing on the agents during those 12 hours? That way I can do more testing to try to replicate it. I was watching my two Ubuntu systems, doing terminal and KVM, but haven't seen memory usage go up after making several connections over and over.
On the console tab, if you select core, and clear core, does the memory usage go down?
I litteraly did nothing except installing them before going to bed yesterday. What do you mean by select and clear core ?
> coreinfo
{
"action": "coreinfo",
"caps": 14,
"osdesc": "AlmaLinux 8.6 (Sky Tiger)",
"root": true,
"users": [],
"value": "Aug 29 2022, 294730201"
}
> coredump
> info
Current Core: Aug 29 2022, 294730201
Agent Time: 2022-09-03 07:47:41.523+02:00.
User Rights: 0xffffffff.
Platform: linux.
Capabilities: 14.
Server URL: wss://remote.example.tld:443/agent.ashx.
OS: AlmaLinux 8.6 (Sky Tiger).
Modules: amt-apfclient, amt-lme, amt-manage, amt-mei, linux-dhcp, monitor-border, smbios, sysinfo, util-agentlog, wifi-scanner-windows, wifi-scanner.
Server Connection: true, State: 1.
X11 support: false.
What types of things were you doing on the agents during those 12 hours? That way I can do more testing to try to replicate it. I was watching my two Ubuntu systems, doing terminal and KVM, but haven't seen memory usage go up after making several connections over and over.
On the console tab, if you select core, and clear core, does the memory usage go down?
Okay I got it (my meshcentral was configured as french, so I didn't understand what you meant).
I have selected clear core
which has worked since it removed the terminal tab.
That didn't change anything for the memory consumption of meshagent process, which is still really high (690M since yesterday).
Ok thanks. Since you said it was idle the whole time, that gives me a few ideas of where to start looking.
Keep me posted if perhaps I can help in someway, like running valgrind or so
Any news on this ? For now, I have to restart meshagent per cron every night so I don't get all my memory eaten on my servers.
In 3 days, meshagent ate 1.7G of memory :(
systemctl status meshagent
● meshagent.service - meshagent background service
Loaded: loaded (/usr/lib/systemd/system/meshagent.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2022-09-06 14:28:52 CEST; 3 days ago
Main PID: 2642573 (meshagent)
Tasks: 1 (limit: 101088)
Memory: 1.7G
CPU: 3h 47min 20.941s
CGroup: /system.slice/meshagent.service
└─2642573 /usr/local/mesh_services/meshagent/meshagent --installedByUser=0
Sep 06 14:28:52 myserver.local systemd[1]: Started meshagent background service.
Not yet, still trying to determine what is causing it, as I set aside two Ubuntu machines to track this down. They've been running for 2 days straight, but memory consumption is still sitting at 46mb.
Thank you for the update. I can provide you with an SSH access if needed.
Thank you for the update. I can provide you with an SSH access if needed.
This would actually be really helpful. I've been fiddling around with different scenarios that I thought would cause this, but I've been running into dead-ends... You can email me the details. I'd love to be able to get to the bottom of this.
This would actually be really helpful. I've been fiddling around with different scenarios that I thought would cause this, but I've been running into dead-ends... You can email me the details. I'd love to be able to get to the bottom of this.
Email sent.
@deajan thanks! By any chance are you able to email me your config.json? I setup a separate agent on your machine, and found that if I connected it to my server, it didn't leak... But if I connected it to your server it did... That means there has to be something in your server config that is causing the agent to do something differently when connected to your server versus mine.
Sent you the config.json
file contents.
If needed, I could perhaps grant you access to my server, after temporarily restricting access to other machines.
I figured out what was going on... There are two main issues going on... The first issue, is that your HAProxy appears to be dropping idle connections after 60 seconds. Your server however, has the default configuration, so that the agent only sends keep alives every 120 seconds, resulting in the control channel getting disconnected/reconnected every 60 seconds. I found that upon reconnection, the agent makes a "sysinfo" call. This call would cause the memory usage to accumulate.
I fixed the agent so that the sysinfo call doesn't cause this problem, but that will require an agent update... However, in your case, I was also able to resolve the memory usage issue by configuring the agent's idle timeout to 50 seconds, so that it sends keep alives every 50 seconds. I configured the agent to do this on your system, and the agent's memory usage according to pmap has been holding steady at 22mb.
If you add the following line in the "settings" section of your config.json:
"agentpong": 50
"browserpong": 50
this should fix the main issue for you, as all the agents that connect, will be configured with the proper keep-alives for your HAProxy setup... Alternatively, you could reconfigure your HAProxy to have a 2 minute idle disconnect interval instead of 1 minute.
@krayon007 Fantastic news, thank you alot. Now that the memory leak itself is healed, I would like to ask you whether a client keep alive every 2 minutes is enough. Haproxy (as well as redhat in https://access.redhat.com/solutions/5357081) recommends setting client timeouts to 30 seconds for the proxy.
Should meshagent keep alive interval be lowered, or perhaps should it be added to meshcentral documentation that haproxy requires timeout client 130
?
As a side question, I've updated my config.json
file with your settings above so I don't have to extend my haproxy timeout.
Is there an easy way to restart all the agents so I don't have to relaunch them agent by agent ?
@deajan simply restarting your server should be sufficient for the agents to capture the new settings. As far as restarting the agent, you can do that server side on each agent, but let me check to see if there is a group action for that.
Regarding your question on the timeout value itself, more documentation is always a good thing. Is your HAProxy using it's default timeout values? If so, and if 30 seconds is the recommended client value with the proxy, then we should add documentation reflecting that in the section that describes configuring a reverse proxy. I will also add documentation about this in the agent side.
On a side note, @deajan I assume you meant you modified your config to set the clients timeout to 30 not 130? 120 is the default, so 130 would be too long with your current HAProxy setup, as I had to lower it to 50 for it to prevent disconnects
There is no really a default value for timeout client
. Just most online docs about HaProxy adivse to setting this to 30s
to maximize security, see this or this.
I guess the the recommended short value is there to avoid overusage of open connections type attacks (I remember something like slowloris doing this).
I meant timeout client 130
in HaProxy so it waits for the meshagent client pinging every 2 minutes.
Just made a fresh Haproxy 2.4 install on RHEL9, the default timeout client
value is 1m. So I guess the agentPong
value should definitly be below that value, or documented to produce timeouts with haproxy.
Hello,
I have a couple of meshagents running on Linux which obviously have a memory leak, after just one day I get about 500MB memory usage per agent. After a couple of days, meshagent takes 2GB (maybe more, but then there's no more system memory and services begin to fail).
I have updated my meshcentral setup to latest git master yesterday v1.0.77 (was previously v1.0.6x), and still have the same memory leaks. Windows agents don't seem affected.
I have no clue what to provide to help debug. My
meshagent.log
file seems pretty empty:Here's the output of
pmap
:This is really bugging me since I need to restart agents everyday so they don't hog all memory.
System: Almalinux 9.0 Linux hyper-npf04.omni.local 5.14.0-70.22.1.el9_0.x86_64 #1 SMP PREEMPT Tue Aug 9 11:45:52 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux Happens both on physical and virtual servers.
Any insights perhaps ?