Closed Matt-CyberGuy closed 2 years ago
on one of the agents that is consuming a lot of ram, can you type fdsnapshot
in the console tab, and let me know what it shows? This should show all the open descriptors the agent thinks it has, and should show what module/script created them.
I had to re-install the agent, I don’t know if that will make a difference or not, but this is from one of the endpoints:
FD[12] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor FD[8] (R: 0, W: 0, E: 0) => Signal_Listener FD[13] (R: 0, W: 0, E: 0) => net.ipcServer FD[14] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp) FD[16] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
What version of debian are you running? I'll setup a test system, and monitor it for a while to see if I can discover anything.
This is what pops up on the command line:
4.19.0-11-untangle-amd64 #1 SMP Debian 4.19.146-1+untangle1buster (2020-09-29) x86_64 GNU/Linux
But the specific product we have running are Untangle firewalls, version 16.1.1. It’s not too difficult getting one up and running in a VM. The only issue you might have is out of the dozen or so firewalls I had it installed on, only 2 were having this issue.
Do you know if there were any connectivity issues with those two firewalls? Perhaps the leak has something to do with how frequently the control channel disconnects/reconnects, etc...
We monitor both in 5 minute increments and neither have been reporting any downtime besides the restarts that were occurring because of the memory leaks. The isp service on the one I sent the diagnostics on is spectrum, which is the same for 90% of our clients, the other system we had to remove the agent from is located in a data center with a failover connection.
Hi, thanks again for being so active on this. Just wanted to give an update. I logged into the firewall and can see meshagent is grabbing a chunk of ram again:
I ran fdsnapshot again, but the results look the same.
How much time do you think elapsed betweeen when the agent started, and when this snapshot was taken? I tracked down a few minor leaks, (kilobytes), so I'm trying to see if it could be contributing to your problem.
Morning, I’m looking at the memory usage graph now. At the plot points where it looks like the endpoint flushes it’s memory and it looks like meshagent restarts on it’s own it varies, but it looks like it takes around 12-14 hours.
I’ve been running meshagent on my firewall at home and found it was using quite a bit of memory like the other systems mentioned. I never noticed since I’ve got 8gb of memory on my firewall. I was able to get an fdsnapshot this time. Hope this helps
[Text Description automatically generated]
I'm seeing the same thing as well on a newish Ubuntu server. Running the latest version of MC and the Agent
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal
Kernel Info
Linux co-k3s-ctrl 5.4.0-54-generic #60-Ubuntu SMP Fri Nov 6 10:37:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Process List
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.2 171164 9192 ? Ss Dec13 0:19 /sbin/init
root 2 0.0 0.0 0 0 ? S Dec13 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? I< Dec13 0:00 [rcu_gp]
root 4 0.0 0.0 0 0 ? I< Dec13 0:00 [rcu_par_gp]
root 6 0.0 0.0 0 0 ? I< Dec13 0:00 [kworker/0:0H-kblockd]
root 8 0.0 0.0 0 0 ? I< Dec13 0:00 [mm_percpu_wq]
root 9 0.0 0.0 0 0 ? S Dec13 1:23 [ksoftirqd/0]
root 10 0.0 0.0 0 0 ? I Dec13 5:53 [rcu_sched]
root 11 0.0 0.0 0 0 ? S Dec13 0:06 [migration/0]
root 12 0.0 0.0 0 0 ? S Dec13 0:00 [idle_inject/0]
root 14 0.0 0.0 0 0 ? S Dec13 0:00 [cpuhp/0]
root 15 0.0 0.0 0 0 ? S Dec13 0:00 [cpuhp/1]
root 16 0.0 0.0 0 0 ? S Dec13 0:00 [idle_inject/1]
root 17 0.0 0.0 0 0 ? S Dec13 0:06 [migration/1]
root 18 0.0 0.0 0 0 ? S Dec13 1:23 [ksoftirqd/1]
root 20 0.0 0.0 0 0 ? I< Dec13 0:00 [kworker/1:0H-kblockd]
root 21 0.0 0.0 0 0 ? S Dec13 0:00 [kdevtmpfs]
root 22 0.0 0.0 0 0 ? I< Dec13 0:00 [netns]
root 23 0.0 0.0 0 0 ? S Dec13 0:00 [rcu_tasks_kthre]
root 24 0.0 0.0 0 0 ? S Dec13 0:00 [kauditd]
root 26 0.0 0.0 0 0 ? S Dec13 0:00 [khungtaskd]
root 27 0.0 0.0 0 0 ? S Dec13 0:00 [oom_reaper]
root 28 0.0 0.0 0 0 ? I< Dec13 0:00 [writeback]
root 29 0.0 0.0 0 0 ? S Dec13 0:00 [kcompactd0]
root 30 0.0 0.0 0 0 ? SN Dec13 0:00 [ksmd]
root 31 0.0 0.0 0 0 ? SN Dec13 0:04 [khugepaged]
root 77 0.0 0.0 0 0 ? I< Dec13 0:00 [kintegrityd]
root 78 0.0 0.0 0 0 ? I< Dec13 0:00 [kblockd]
root 79 0.0 0.0 0 0 ? I< Dec13 0:00 [blkcg_punt_bio]
root 80 0.0 0.0 0 0 ? I< Dec13 0:00 [tpm_dev_wq]
root 81 0.0 0.0 0 0 ? I< Dec13 0:00 [ata_sff]
root 82 0.0 0.0 0 0 ? I< Dec13 0:00 [md]
root 83 0.0 0.0 0 0 ? I< Dec13 0:00 [edac-poller]
root 84 0.0 0.0 0 0 ? I< Dec13 0:00 [devfreq_wq]
root 85 0.0 0.0 0 0 ? S Dec13 0:00 [watchdogd]
root 88 0.0 0.0 0 0 ? S Dec13 4:27 [kswapd0]
root 89 0.0 0.0 0 0 ? S Dec13 0:00 [ecryptfs-kthrea]
root 91 0.0 0.0 0 0 ? I< Dec13 0:00 [kthrotld]
root 92 0.0 0.0 0 0 ? S Dec13 0:00 [irq/24-pciehp]
root 93 0.0 0.0 0 0 ? S Dec13 0:00 [irq/25-pciehp]
root 94 0.0 0.0 0 0 ? S Dec13 0:00 [irq/26-pciehp]
root 95 0.0 0.0 0 0 ? S Dec13 0:00 [irq/27-pciehp]
root 96 0.0 0.0 0 0 ? S Dec13 0:00 [irq/28-pciehp]
root 97 0.0 0.0 0 0 ? S Dec13 0:00 [irq/29-pciehp]
root 98 0.0 0.0 0 0 ? S Dec13 0:00 [irq/30-pciehp]
root 99 0.0 0.0 0 0 ? S Dec13 0:00 [irq/31-pciehp]
root 100 0.0 0.0 0 0 ? S Dec13 0:00 [irq/32-pciehp]
root 101 0.0 0.0 0 0 ? S Dec13 0:00 [irq/33-pciehp]
root 102 0.0 0.0 0 0 ? S Dec13 0:00 [irq/34-pciehp]
root 103 0.0 0.0 0 0 ? S Dec13 0:00 [irq/35-pciehp]
root 104 0.0 0.0 0 0 ? S Dec13 0:00 [irq/36-pciehp]
root 105 0.0 0.0 0 0 ? S Dec13 0:00 [irq/37-pciehp]
root 106 0.0 0.0 0 0 ? S Dec13 0:00 [irq/38-pciehp]
root 107 0.0 0.0 0 0 ? S Dec13 0:00 [irq/39-pciehp]
root 108 0.0 0.0 0 0 ? S Dec13 0:00 [irq/40-pciehp]
root 109 0.0 0.0 0 0 ? S Dec13 0:00 [irq/41-pciehp]
root 110 0.0 0.0 0 0 ? S Dec13 0:00 [irq/42-pciehp]
root 111 0.0 0.0 0 0 ? S Dec13 0:00 [irq/43-pciehp]
root 112 0.0 0.0 0 0 ? S Dec13 0:00 [irq/44-pciehp]
root 113 0.0 0.0 0 0 ? S Dec13 0:00 [irq/45-pciehp]
root 114 0.0 0.0 0 0 ? S Dec13 0:00 [irq/46-pciehp]
root 115 0.0 0.0 0 0 ? S Dec13 0:00 [irq/47-pciehp]
root 116 0.0 0.0 0 0 ? S Dec13 0:00 [irq/48-pciehp]
root 117 0.0 0.0 0 0 ? S Dec13 0:00 [irq/49-pciehp]
root 118 0.0 0.0 0 0 ? S Dec13 0:00 [irq/50-pciehp]
root 119 0.0 0.0 0 0 ? S Dec13 0:00 [irq/51-pciehp]
root 120 0.0 0.0 0 0 ? S Dec13 0:00 [irq/52-pciehp]
root 121 0.0 0.0 0 0 ? S Dec13 0:00 [irq/53-pciehp]
root 122 0.0 0.0 0 0 ? S Dec13 0:00 [irq/54-pciehp]
root 123 0.0 0.0 0 0 ? S Dec13 0:00 [irq/55-pciehp]
root 124 0.0 0.0 0 0 ? I< Dec13 0:00 [acpi_thermal_pm]
root 125 0.0 0.0 0 0 ? S Dec13 0:00 [scsi_eh_0]
root 126 0.0 0.0 0 0 ? I< Dec13 0:00 [scsi_tmf_0]
root 127 0.0 0.0 0 0 ? S Dec13 0:04 [scsi_eh_1]
root 128 0.0 0.0 0 0 ? I< Dec13 0:00 [scsi_tmf_1]
root 130 0.0 0.0 0 0 ? I< Dec13 0:00 [vfio-irqfd-clea]
root 131 0.0 0.0 0 0 ? I< Dec13 0:00 [ipv6_addrconf]
root 141 0.0 0.0 0 0 ? I< Dec13 0:00 [kstrp]
root 144 0.0 0.0 0 0 ? I< Dec13 0:00 [kworker/u5:0]
root 157 0.0 0.0 0 0 ? I< Dec13 0:00 [charger_manager]
root 158 0.0 0.0 0 0 ? I< Dec13 0:47 [kworker/1:1H-kblockd]
root 203 0.0 0.0 0 0 ? S Dec13 0:00 [scsi_eh_2]
root 204 0.0 0.0 0 0 ? I< Dec13 0:00 [scsi_tmf_2]
root 205 0.0 0.0 0 0 ? I< Dec13 0:00 [vmw_pvscsi_wq_2]
root 206 0.0 0.0 0 0 ? I< Dec13 0:00 [cryptd]
root 207 0.0 0.0 0 0 ? I< Dec13 0:44 [kworker/0:1H-kblockd]
root 218 0.0 0.0 0 0 ? S Dec13 1:24 [irq/16-vmwgfx]
root 220 0.0 0.0 0 0 ? I< Dec13 0:00 [ttm_swap]
root 265 0.0 0.0 0 0 ? I< Dec13 0:00 [kdmflush]
root 292 0.0 0.0 0 0 ? I< Dec13 0:00 [raid5wq]
root 332 0.0 0.0 0 0 ? S Dec13 10:31 [jbd2/dm-0-8]
root 333 0.0 0.0 0 0 ? I< Dec13 0:00 [ext4-rsv-conver]
root 404 0.0 0.1 486168 7008 ? S<s Dec13 9:00 /lib/systemd/systemd-journald
root 436 0.0 0.0 21896 4020 ? Ss Dec13 0:05 /lib/systemd/systemd-udevd
root 628 0.0 0.0 0 0 ? I< Dec13 0:00 [kaluad]
root 629 0.0 0.0 0 0 ? I< Dec13 0:00 [kmpath_rdacd]
root 630 0.0 0.0 0 0 ? I< Dec13 0:00 [kmpathd]
root 631 0.0 0.0 0 0 ? I< Dec13 0:00 [kmpath_handlerd]
root 632 0.0 0.4 345920 18328 ? SLsl Dec13 13:24 /sbin/multipathd -d -s
root 642 0.0 0.0 0 0 ? S Dec13 0:00 [jbd2/sda2-8]
root 643 0.0 0.0 0 0 ? I< Dec13 0:00 [ext4-rsv-conver]
root 647 0.0 0.0 0 0 ? S< Dec13 0:00 [loop0]
root 652 0.0 0.0 0 0 ? S< Dec13 0:00 [loop2]
root 653 0.0 0.0 0 0 ? S< Dec13 0:00 [loop3]
root 654 0.0 0.0 0 0 ? S< Dec13 0:00 [loop4]
root 655 0.0 0.0 0 0 ? S< Dec13 0:00 [loop5]
root 656 0.0 0.0 0 0 ? S< Dec13 0:02 [loop6]
root 657 0.0 0.0 0 0 ? S< Dec13 0:00 [loop7]
systemd+ 679 0.0 0.0 90424 3084 ? Ssl Dec13 0:02 /lib/systemd/systemd-timesyncd
root 687 0.0 0.0 46460 2036 ? Ss Dec13 0:00 /usr/bin/VGAuthService
root 688 0.0 0.0 235068 2848 ? Ssl Dec13 18:49 /usr/bin/vmtoolsd
systemd+ 763 0.0 0.1 26920 4344 ? Ss Dec13 0:03 /lib/systemd/systemd-networkd
systemd+ 765 0.0 0.1 24356 4244 ? Ss Dec13 0:05 /lib/systemd/systemd-resolved
root 779 0.0 0.0 238076 3184 ? Ssl Dec13 1:36 /usr/lib/accountsservice/accounts-daemon
root 782 0.0 0.0 5568 2264 ? Ss Dec13 0:01 /usr/sbin/cron -f
message+ 783 0.0 0.0 7544 3852 ? Ss Dec13 0:01 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root 791 0.0 0.0 81960 2772 ? Ssl Dec13 1:49 /usr/sbin/irqbalance --foreground
root 795 0.0 0.0 26300 3352 ? Ss Dec13 0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
syslog 796 0.0 0.0 224348 3520 ? Ssl Dec13 1:55 /usr/sbin/rsyslogd -n -iNONE
root 798 0.0 0.2 932740 9820 ? Ssl Dec13 1:32 /usr/lib/snapd/snapd
root 807 0.0 0.0 16764 2980 ? Ss Dec13 0:02 /lib/systemd/systemd-logind
root 809 0.2 0.5 1654356 21700 ? Ssl Dec13 51:31 /usr/local/bin/teleport start --roles=node --config=/etc/teleport.yaml --pid-file=/run/teleport.pid
daemon 812 0.0 0.0 3792 1840 ? Ss Dec13 0:00 /usr/sbin/atd -f
root 842 0.0 0.0 2860 1448 tty1 Ss+ Dec13 0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
root 848 0.0 0.0 12176 2268 ? Ss Dec13 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root 862 0.0 0.0 105120 3276 ? Ssl Dec13 0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
root 912 0.0 0.0 236432 3568 ? Ssl Dec13 0:00 /usr/lib/policykit-1/polkitd --no-debug
mysql 919 10.6 4.6 1610736 185956 ? Sl Dec13 2294:41 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
root 2806 0.0 0.1 972708 4808 ? Ssl Dec13 2:42 /usr/bin/containerd
root 2982 0.0 0.2 1094216 11800 ? Ssl Dec13 3:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
root 3017 0.0 0.0 2488 520 ? S Dec13 0:00 bpfilter_umh
root 4241 0.0 0.0 0 0 ? S Dec13 0:03 [jbd2/sdb-8]
root 4242 0.0 0.0 0 0 ? I< Dec13 0:00 [ext4-rsv-conver]
root 4725 0.0 0.0 549044 764 ? Sl Dec13 0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9000 -container-ip 172.17.0.2 -container-port 9000
root 4739 0.0 0.0 111972 588 ? Sl Dec13 0:49 /usr/bin/containerd-shim-runc-v2 -namespace moby -id ccb3a7cfd700e3825b7894f839d47c266e067b9d18e33a6721a208edf90045b5 -address /run/containerd/containerd.soc
root 4761 0.3 1.3 892680 52996 ? Ssl Dec13 72:40 minio server /data
root 24915 0.0 0.0 0 0 ? I< Dec13 0:00 [xfsalloc]
root 24916 0.0 0.0 0 0 ? I< Dec13 0:00 [xfs_mru_cache]
root 24919 0.0 0.0 0 0 ? S Dec13 0:00 [jfsIO]
root 24920 0.0 0.0 0 0 ? S Dec13 0:00 [jfsCommit]
root 24921 0.0 0.0 0 0 ? S Dec13 0:00 [jfsCommit]
root 24922 0.0 0.0 0 0 ? S Dec13 0:00 [jfsSync]
root 236478 0.0 0.0 0 0 ? S< Dec16 0:00 [loop8]
root 2996823 0.6 77.5 3213920 3126564 ? Ss Dec27 1:28 /usr/local/mesh/meshagent
root 3210168 0.0 0.0 0 0 ? I 00:00 0:00 [kworker/0:1-events]
root 3210169 0.0 0.0 0 0 ? I 00:00 0:01 [kworker/0:3-events]
root 3210182 0.0 0.0 0 0 ? I 00:00 0:00 [kworker/1:0-events]
root 3210183 0.0 0.0 0 0 ? I 00:00 0:01 [kworker/1:3-events]
root 3257149 0.0 0.0 0 0 ? I 00:36 0:00 [kworker/u4:3-events_power_efficient]
root 3267720 0.0 0.0 0 0 ? I 00:44 0:00 [kworker/u4:1-events_unbound]
root 3290162 0.0 0.0 0 0 ? I 01:02 0:00 [kworker/u4:0-scsi_tmf_1]
root 3298184 0.0 0.0 0 0 ? I 01:09 0:00 [kworker/u4:2-events_power_efficient]
root 3300887 0.0 0.0 5992 3952 pts/0 Ss 01:12 0:00 bash
root 3302235 0.0 0.0 7648 3260 pts/0 R+ 01:13 0:00 ps waux
When your agent is in this state, what does fdsnapshot
and timerinfo
return from the console tab?
Here is what is there right now, but I just rebooted it so I will post again. Usually takes about a hour to grow large
> fdsnapshot
Chain Timeout: 120045 milliseconds
FD[13] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
FD[12] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
FD[14] (R: 0, W: 0, E: 0) => net.ipcServer
FD[15] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
> timerinfo
Timer: 19.5 minutes (0x14fd6910) [setInterval(), meshcore (InfoUpdate Timer)]
Current Status
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 3322186 0.5 8.2 344696 334256 ? Ss 01:30 0:08 /usr/local/mesh/meshagent
I got a different Agent running that looks to be showing the same thing, here is that info
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 7557 0.1 34.6 743600 708044 ? Ssl Dec27 1:50 /usr/local/mesh/meshagent
Server Info
root@pnap-k8s-utility-01:/usr/local/mesh# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
root@pnap-k8s-utility-01:/usr/local/mesh# uname -a
Linux pnap-k8s-utility-01.pnap.aimitservices.com 4.15.0-123-generic #126-Ubuntu SMP Wed Oct 21 09:40:11 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Console
> fdsnapshot
Chain Timeout: 91474 milliseconds
FD[12] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
FD[13] (R: 0, W: 0, E: 0) => net.ipcServer
FD[17] (R: 0, W: 0, E: 0) => (stderr) childProcess (pid=26415), Remote Terminal
FD[19] (R: 0, W: 0, E: 0) => (stdout) childProcess (pid=26415), Remote Terminal
FD[14] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
FD[15] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
FD[16] (R: 0, W: 0, E: 0) => https.WebSocketStream, MeshAgent_relayTunnel, Remote Terminal
> timerinfo
Timer: 19.2 minutes (0x2c971000) [setInterval(), meshcore (InfoUpdate Timer)]
If it matters I have MC setup inside of Kubernetes
I'm seeing this happen again on the Debian install I mentioned at the beginning of this thread. Oddly, there are also two meshagent sessions running.
fdsnapshot
Huh, interesting, I don't know if this is helpful or not, but I just tried to connect to the endpoint over MC and it looks like meshagent crashed, or restarted. It took a minute, but it came back up, memory usage is now back to normal. Below is the full fdsnapshot and timerinfo from before and after
There's definitely some kind of loop happening while the agent is inactive.
Interesting... Your agent didn't create a dump file did it? Normally (on linux anyways), its configured so that if the agent crashes it should restart immediately. I'll see if I can do some testing with sleep states, to see if anything screwy happens with the linux agent if the platform goes to standby and such.
Found it... maybe not the dump, but the main log file at least.
[2020-12-24 12:28:45 AM] Info: No certificate was found in db
[2020-12-26 03:00:54 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7faf23a37840]
/usr/local/mesh/meshagent() [0x426bec]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7faf23a2409b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-26 04:15:57 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7fd1a04fc840]
/usr/local/mesh/meshagent() [0x426bec]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fd1a04e909b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-26 08:55:50 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7ff4d1b8d840]
/usr/local/mesh/meshagent() [0x426bec]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7ff4d1b7a09b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-26 02:22:08 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f9ffe542840]
/usr/local/mesh/meshagent() [0x426bec]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f9ffe52f09b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-26 03:26:48 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7ff4bb219840]
/usr/local/mesh/meshagent() [0x426be9]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7ff4bb20609b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-26 04:13:31 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7fb3fa07e840]
/usr/local/mesh/meshagent() [0x426be9]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fb3fa06b09b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-26 06:34:35 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f9027487840]
/usr/local/mesh/meshagent() [0x426bf0]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f902747409b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-26 09:00:23 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f553cfc6840]
/usr/local/mesh/meshagent() [0x426be9]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f553cfb309b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-26 11:05:10 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f1dee07d840]
/usr/local/mesh/meshagent() [0x426bf0]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f1dee06a09b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-27 12:16:01 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f5844d1c840]
/usr/local/mesh/meshagent() [0x426bf0]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f5844d0909b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-27 06:50:13 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f857c050840]
/usr/local/mesh/meshagent() [0x426be9]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f857c03d09b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-27 08:16:58 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7fc4197ce840]
/usr/local/mesh/meshagent() [0x426be9]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fc4197bb09b]
/usr/local/mesh/meshagent() [0x40daf1]
[2020-12-29 08:57:37 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7fe244598840]
/usr/local/mesh/meshagent() [0x43d310]
/usr/local/mesh/meshagent() [0x4cfd10]
/usr/local/mesh/meshagent() [0x446371]
/usr/local/mesh/meshagent() [0x446d92]
/usr/local/mesh/meshagent() [0x44d96e]
/usr/local/mesh/meshagent() [0x44dcf5]
/usr/local/mesh/meshagent() [0x49950f]
/usr/local/mesh/meshagent() [0x446371]
/usr/local/mesh/meshagent() [0x47689a]
/usr/local/mesh/meshagent() [0x4456ef]
/usr/local/mesh/meshagent() [0x445fbf]
/usr/local/mesh/meshagent() [0x446d92]
/usr/local/mesh/meshagent() [0x44d96e]
/usr/local/mesh/meshagent() [0x44dcf5]
/usr/local/mesh/meshagent() [0x49950f]
/usr/local/mesh/meshagent() [0x446371]
/usr/local/mesh/meshagent() [0x446d92]
/usr/local/mesh/meshagent() [0x44d96e]
/usr/local/mesh/meshagent() [0x44dcf5]
/usr/local/mesh/meshagent() [0x4918dd]
/usr/local/mesh/meshagent() [0x41e753]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fe24458509b]
/usr/local/mesh/meshagent() [0x40daf1]
Is there a command to run the agent to run in a debug mode?
I fixed a issue with agents connecting/disconnecting all the time (AgentPing set to 30 now). This seems to have fixed the RAM usage for now, but I will check back tomorrow. Could a agent connecting/disconnecting all the time cause this issue?
That sounds great... is this a setting for the config.plist? I have an AgentPong setting in there that's currently set to 800, but no AgentPing
Ya, it's not in the default configuration but you can add it. I went off this post to fix the agent issue.https://github.com/Ylianst/MeshCentral/issues/2050
Looking good.
root 1463310 0.0 0.7 42016 30164 ? Ssl 00:00 0:00 /usr/local/mesh/meshagent
2 Hours Later
root 1463310 0.0 0.7 42148 28232 ? Ssl 00:00 0:00 /usr/local/mesh/meshagent
14 Hour laters, still looking good.
root 1463310 0.0 0.7 42420 29992 ? Ssl 00:00 0:01 /usr/local/mesh/meshagent
Here is my server config as well
{
"$schema": "http://info.meshcentral.com/downloads/meshcentral-config-schema.json",
"settings": {
"mongodb": "mongodb://..../meshcentral",
"cert": "$SITEURL",
"WANonly": true,
"Minify": 1,
"agentIdleTimeout": 3600,
"AllowHighQualityDesktop": true,
"AgentPing": 30,
"TlsOffload": true,
"trustedProxy": true,
"AliasPort": 443,
"Port": 4430
},
"domains": {
"": {
"title": "AIM IT Services",
"title2": "PNAP",
"certUrl": "https://$SITEURL:443/",
"agentConfig": [ "webSocketMaskOverride=1" ],
"NewAccounts": 1,
"authStrategies": {
"saml": {
"newAccounts": true,
"callbackurl": "https://$SITEURL/auth-saml-callback",
"entityid": "$SITEURL",
"idpurl": "https://$IDURL/protocol/saml",
"cert": "saml-aim.pem"
}
}
}
}
}
I made the changes to the config.json file and restarted the server, but 2 of the clients I had meshagent on still had a memory leak.
I also noticed, they both had 2 persistent instances of meshagent running, is that normal?
I made the changes to the config.json file and restarted the server, but 2 of the clients I had meshagent on still had a memory leak.
I also noticed, they both had 2 persistent instances of meshagent running, is that normal?
No. While it's showing two instances running, what does "fdsnapshot" return on the agent?
Luckily this firewall has 8G of memory, so the leak hasn't been causing any issues that I know of, but this is one I have 2 instances on.
It almost looks like they're the same instance, but the PIDs are different. Here's the fdsnapshot
> fdsnapshot
Chain Timeout: 119659 milliseconds
FD[12] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
FD[13] (R: 0, W: 0, E: 0) => net.ipcServer
FD[15] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
FD[14] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
That's really wierd. That snapshot shows that the agent doesn't think it has any child processes. It looks completely idle. Are you able to run the following command to see what the command line parameters of those two processes are?
ps -ef | grep meshagent
[root @ firewall] ~ # ps -ef | grep meshagent
root 60286 1 0 Jan12 ? 00:02:35 /usr/local/mesh/meshagent
root 117892 113721 0 09:25 pts/1 00:00:00 grep meshagent
[root @ firewall] ~ #
[root @ firewall] ~ # ps -ef | grep meshagent root 60286 1 0 Jan12 ? 00:02:35 /usr/local/mesh/meshagent root 117892 113721 0 09:25 pts/1 00:00:00 grep meshagent [root @ firewall] ~ #
That looks like there's only a single instance running?
Right?... I don't absolutely need these endpoints in MeshCentral, but it is a nice convenience... I'm just wondering why some systems have these leaks and some don't, feels kind of random
You said your firewall is running inside a docker? Can you share your dockerfile? Maybe I can reproduce it over here, and try to figure it out?
Sorry, I don't know if I stated our setup incorrectly... I have MeshCentral running in a docker instance, it sits behind an NGINX reverse proxy, I'll go ahead and post the compose file I've been using and the MeshCentral config
version: '3.5'
services:
MeshCentral:
image: uldiseihenbergs/meshcentral
container_name: MeshCentral
restart: always
cap_add:
- NET_ADMIN
ports:
- 4433:4433
volumes:
- /opt/MeshCentral/files/:/home/node/meshcentral/meshcentral-files
- /opt/MeshCentral/data/:/home/node/meshcentral/meshcentral-data
- /opt/MeshCentral/backup/:/home/node/meshcentral/meshcentral-backup
- /opt/MeshCentral/web/:/home/node/meshcentral/meshcentral-web
networks:
Network-Bridge:
ipv4_address: 172.0.0.150
networks:
Network-Bridge:
driver: bridge
name: Network-Bridge
ipam:
config:
- subnet: 172.0.0.0/16
{
"$schema": "http://info.meshcentral.com/downloads/meshcentral-config-schema.json",
"settings": {
"cert": "xxxxxx",
"WANonly": true,
"selfupdate": true,
"noagentupdate": true,
"webrtc": true,
"Port": 443,
"redirPort": 80,
"AgentPing": 30,
"redirAliasPort": 80,
"TlsOffload": "172.0.0.101"
},
"domains": {
"": {
"title": "Cylanda",
"title2": "<br/>Cybersecurity | LAN | Data<h2>Remote Monitoring Portal</h2>",
"welcomeText": "Please enter your email and given password above.<br/>If you are experiencing difficulty connecting to endpoints, need to add/remove users or systems, <br/>or you would like to have you...",
"titlePicture": "title-cylanda.png",
"minify": true,
"newAccounts": false,
"userNameIsEmail": true,
"PasswordRequirements": { "min": 8, "max": 64, "upper": 1, "lower": 1, "numeric": 1, "nonalpha": 1, "hint": true, "reset": 90, "force2factor": true },
"ManageAllDeviceGroups":["xxxxx@xxxxx.com"],
"Footer": "<a target=_blank href='https://xxxx.com'>xxxx</a>",
"certUrl": "https://xxx.xxxx.com",
"userAllowedIp" : "xxx.xxx.xxx.xxx",
}
},
"_letsencrypt": {
"__comment__": "Requires NodeJS 8.x or better, Go to https://letsdebug.net/ first before trying Let's Encrypt.",
"email": "xxx@xxxx.com",
"names": "xxx.xxxx.com",
"production": false
}
}
Anything I can do to help here? It appears for me at least fixing the agents from constantly disconnecting fix my RAM issues.
Unfortunately, unless you see something in my config above that could change, I don't know what else to do. even with the ping time set to 30, the same endpoints having the memory leak were still having issues
Unfortunately, unless you see something in my config above that could change, I don't know what else to do. even with the ping time set to 30, the same endpoints having the memory leak were still having issues
What's the software/hardware configuration of your endpoints having issue? ie, what linux distribution is it running? I can try to see if I can create a VM to recreate your client setup to see if I can reproduce these leaks you are seeing.
I've been running it on Untangle linux based firewalls. Most seem to run the agent fine, and for the most part, all of the endpoints are configured the same.
The easiest thing to do for downloading is to go to their wiki page. https://wiki.untangle.com/index.php/NG_Firewall_Downloads
I've been running it on Untangle linux based firewalls. Most seem to run the agent fine, and for the most part, all of the endpoints are configured the same.
The easiest thing to do for downloading is to go to their wiki page. https://wiki.untangle.com/index.php/NG_Firewall_Downloads
Ok, I was actually able to reproduce the leak when I configured Untangle in a VM. I found that only my Untangle VM showed this leak... When I dug around, it turns out that I have a file watcher on /var/run/utmp. This normally gets modified when someone logs in or out of the system... I found that on Untangle (at least the one I had configured in my VM), behaved much differently... Even when nothing was running, /var/run/utmp always showed the current time, meaning something was touching it every second... Because it was getting touched every second, this caused my file watcher to trigger every second... Because the normal logic is that this is triggered on login/logoff events, a child process was spawned to query that information....The leak itself, was actually a referencing issue with one of the javascript objects, that did not delete a table entry, it only set the value to null... This cauased the object to get GC'ed, but left the table entry, causing the table to slowly grow over time...
We plan on fixing this in two different ways... One, is to detect the utmp situation, and prevent a child process from spawning every second... This can be accomplished with just a server side update. The actual fix to the leak itself, will require an agent update, so that will come separately... The server side fix should take care of this "leak" in the short term. Especially since it probably isn't a good idea to spawn a process every second for no reason, lol... But I have no idea why /var/run/utmp is behaving this way on untangle. I verified with several other distros, and none of them exhibit this behavior.
Awesome! I'm so glad you were able to track it down!
I don't know if you had it on or not, but one of the best free features of Untangle is the intrusion detection system. It's not very interactive compared to a standalone dedicated IDS, but what it does is powerful. It unfortunately can also be aggressive and is not a set-it-and-forget-it type of deal, so it can cause random network issues that have to be smoothed out over time. I'm wondering if it might have had anything to do with it, or if some other security subsystem might be to blame.
Wow, this is such a massive difference. Even while in use, the footprint of the meshagent is tiny! Thank you so much for this fix!!
I detected memory leak too on my debian servers with latest mesh version 0.8.36. It happens on my proxmox server(debian based), on my proxmox backup server(debian based), it happens inside LXC containers (debian based). It leaks around 1% of RAM(6 GB) every 24Hr. What debug info i need to post here for debug? P.S All my servers behid 1 router with Hairpin NAT for Mesh configured, but one on remote location with direct access to mesh through internet.
I do still noticed this every now and then. It normally happens when I have issues on the server side. I'm running it under K8s behind the NGINX ingress controller. When there is a configuration change in the cluster, this can cause the controller process (nginx) reload and move things around. When that happens I always have a few agents that spike up in RAM usage, I have since added a Zabbix alert and I just restart them (the service).
While not a fix, I did update the systemd service file to have memory limits /lib/systemd/system/meshagent.service
[Unit]
Description=MeshCentral Agent
[Service]
WorkingDirectory=/usr/local/mesh
ExecStart=/usr/local/mesh/meshagent
StandardOutput=null
Restart=always
RestartSec=3
MemoryAccounting=true
MemoryHigh=200M
MemoryMax=300M
[Install]
WantedBy=multi-user.target
Alias=meshagent.service
Here is the monitoring of one of the agents
Hmm,interesting my MC behind NGINX too, and thanx for service config!
Still having the issue but I swapped from NGINX Ingress to a HAProxy ingress and it seems to have helped. Nginx reload will trigger disconnects which I think was causing the spikes in RAM. HAproxy can reload without disconnecting clients. In the kubernetes cluster anytime there is a change in pods it triggers a reload of the ingress controller which happens all the time.
Red Arrow is where I swapped ingress
Hey All,
I've been loving this project... it's been a life saver in terms of some projects I've had running. Recently tho, I discovered 2 Debian clients with a slow memory leak. I have the client installed on a number of other identical endpoints (all linux firewalls), and haven't experienced this issue anywhere except on these 2 machines. When the issue is occurring, htop shows Meshagent using around 2gb or ram. It will slowly creep up over time until the system halts and reboots (see the graphic for the past week)... this happens even though there are no connections and sometimes there hasn't been a connection for a week or more.
The last two dips in the graphic that don't reach 90-100% is where I uninstalled the agent and re-installed.
I have MC2 running in a docker instance and regularly keep it updated.