Debian Agent Memory Leak

Matt-CyberGuy commented 3 years ago

Hey All,

I've been loving this project... it's been a life saver in terms of some projects I've had running. Recently tho, I discovered 2 Debian clients with a slow memory leak. I have the client installed on a number of other identical endpoints (all linux firewalls), and haven't experienced this issue anywhere except on these 2 machines. When the issue is occurring, htop shows Meshagent using around 2gb or ram. It will slowly creep up over time until the system halts and reboots (see the graphic for the past week)... this happens even though there are no connections and sometimes there hasn't been a connection for a week or more.

The last two dips in the graphic that don't reach 90-100% is where I uninstalled the agent and re-installed.

I have MC2 running in a docker instance and regularly keep it updated.

krayon007 commented 3 years ago

on one of the agents that is consuming a lot of ram, can you type fdsnapshot in the console tab, and let me know what it shows? This should show all the open descriptors the agent thinks it has, and should show what module/script created them.

Matt-CyberGuy commented 3 years ago

I had to re-install the agent, I don’t know if that will make a difference or not, but this is from one of the endpoints:

FD[12] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor FD[8] (R: 0, W: 0, E: 0) => Signal_Listener FD[13] (R: 0, W: 0, E: 0) => net.ipcServer FD[14] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp) FD[16] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4

krayon007 commented 3 years ago

What version of debian are you running? I'll setup a test system, and monitor it for a while to see if I can discover anything.

Matt-CyberGuy commented 3 years ago

This is what pops up on the command line:

4.19.0-11-untangle-amd64 #1 SMP Debian 4.19.146-1+untangle1buster (2020-09-29) x86_64 GNU/Linux

But the specific product we have running are Untangle firewalls, version 16.1.1. It’s not too difficult getting one up and running in a VM. The only issue you might have is out of the dozen or so firewalls I had it installed on, only 2 were having this issue.

krayon007 commented 3 years ago

Do you know if there were any connectivity issues with those two firewalls? Perhaps the leak has something to do with how frequently the control channel disconnects/reconnects, etc...

Matt-CyberGuy commented 3 years ago

We monitor both in 5 minute increments and neither have been reporting any downtime besides the restarts that were occurring because of the memory leaks. The isp service on the one I sent the diagnostics on is spectrum, which is the same for 90% of our clients, the other system we had to remove the agent from is located in a data center with a failover connection.

Matt-CyberGuy commented 3 years ago

Hi, thanks again for being so active on this. Just wanted to give an update. I logged into the firewall and can see meshagent is grabbing a chunk of ram again:

I ran fdsnapshot again, but the results look the same.

krayon007 commented 3 years ago

How much time do you think elapsed betweeen when the agent started, and when this snapshot was taken? I tracked down a few minor leaks, (kilobytes), so I'm trying to see if it could be contributing to your problem.

Matt-CyberGuy commented 3 years ago

Morning, I’m looking at the memory usage graph now. At the plot points where it looks like the endpoint flushes it’s memory and it looks like meshagent restarts on it’s own it varies, but it looks like it takes around 12-14 hours.

Matt-CyberGuy commented 3 years ago

I’ve been running meshagent on my firewall at home and found it was using quite a bit of memory like the other systems mentioned. I never noticed since I’ve got 8gb of memory on my firewall. I was able to get an fdsnapshot this time. Hope this helps

[Text Description automatically generated]

zanderson-aim commented 3 years ago

I'm seeing the same thing as well on a newish Ubuntu server. Running the latest version of MC and the Agent

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.1 LTS
Release:        20.04
Codename:       focal

Kernel Info

Linux co-k3s-ctrl 5.4.0-54-generic #60-Ubuntu SMP Fri Nov 6 10:37:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Process List

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.2 171164  9192 ?        Ss   Dec13   0:19 /sbin/init
root           2  0.0  0.0      0     0 ?        S    Dec13   0:00 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<   Dec13   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<   Dec13   0:00 [rcu_par_gp]
root           6  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kworker/0:0H-kblockd]
root           8  0.0  0.0      0     0 ?        I<   Dec13   0:00 [mm_percpu_wq]
root           9  0.0  0.0      0     0 ?        S    Dec13   1:23 [ksoftirqd/0]
root          10  0.0  0.0      0     0 ?        I    Dec13   5:53 [rcu_sched]
root          11  0.0  0.0      0     0 ?        S    Dec13   0:06 [migration/0]
root          12  0.0  0.0      0     0 ?        S    Dec13   0:00 [idle_inject/0]
root          14  0.0  0.0      0     0 ?        S    Dec13   0:00 [cpuhp/0]
root          15  0.0  0.0      0     0 ?        S    Dec13   0:00 [cpuhp/1]
root          16  0.0  0.0      0     0 ?        S    Dec13   0:00 [idle_inject/1]
root          17  0.0  0.0      0     0 ?        S    Dec13   0:06 [migration/1]
root          18  0.0  0.0      0     0 ?        S    Dec13   1:23 [ksoftirqd/1]
root          20  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kworker/1:0H-kblockd]
root          21  0.0  0.0      0     0 ?        S    Dec13   0:00 [kdevtmpfs]
root          22  0.0  0.0      0     0 ?        I<   Dec13   0:00 [netns]
root          23  0.0  0.0      0     0 ?        S    Dec13   0:00 [rcu_tasks_kthre]
root          24  0.0  0.0      0     0 ?        S    Dec13   0:00 [kauditd]
root          26  0.0  0.0      0     0 ?        S    Dec13   0:00 [khungtaskd]
root          27  0.0  0.0      0     0 ?        S    Dec13   0:00 [oom_reaper]
root          28  0.0  0.0      0     0 ?        I<   Dec13   0:00 [writeback]
root          29  0.0  0.0      0     0 ?        S    Dec13   0:00 [kcompactd0]
root          30  0.0  0.0      0     0 ?        SN   Dec13   0:00 [ksmd]
root          31  0.0  0.0      0     0 ?        SN   Dec13   0:04 [khugepaged]
root          77  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kintegrityd]
root          78  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kblockd]
root          79  0.0  0.0      0     0 ?        I<   Dec13   0:00 [blkcg_punt_bio]
root          80  0.0  0.0      0     0 ?        I<   Dec13   0:00 [tpm_dev_wq]
root          81  0.0  0.0      0     0 ?        I<   Dec13   0:00 [ata_sff]
root          82  0.0  0.0      0     0 ?        I<   Dec13   0:00 [md]
root          83  0.0  0.0      0     0 ?        I<   Dec13   0:00 [edac-poller]
root          84  0.0  0.0      0     0 ?        I<   Dec13   0:00 [devfreq_wq]
root          85  0.0  0.0      0     0 ?        S    Dec13   0:00 [watchdogd]
root          88  0.0  0.0      0     0 ?        S    Dec13   4:27 [kswapd0]
root          89  0.0  0.0      0     0 ?        S    Dec13   0:00 [ecryptfs-kthrea]
root          91  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kthrotld]
root          92  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/24-pciehp]
root          93  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/25-pciehp]
root          94  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/26-pciehp]
root          95  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/27-pciehp]
root          96  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/28-pciehp]
root          97  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/29-pciehp]
root          98  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/30-pciehp]
root          99  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/31-pciehp]
root         100  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/32-pciehp]
root         101  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/33-pciehp]
root         102  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/34-pciehp]
root         103  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/35-pciehp]
root         104  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/36-pciehp]
root         105  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/37-pciehp]
root         106  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/38-pciehp]
root         107  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/39-pciehp]
root         108  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/40-pciehp]
root         109  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/41-pciehp]
root         110  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/42-pciehp]
root         111  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/43-pciehp]
root         112  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/44-pciehp]
root         113  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/45-pciehp]
root         114  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/46-pciehp]
root         115  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/47-pciehp]
root         116  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/48-pciehp]
root         117  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/49-pciehp]
root         118  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/50-pciehp]
root         119  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/51-pciehp]
root         120  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/52-pciehp]
root         121  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/53-pciehp]
root         122  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/54-pciehp]
root         123  0.0  0.0      0     0 ?        S    Dec13   0:00 [irq/55-pciehp]
root         124  0.0  0.0      0     0 ?        I<   Dec13   0:00 [acpi_thermal_pm]
root         125  0.0  0.0      0     0 ?        S    Dec13   0:00 [scsi_eh_0]
root         126  0.0  0.0      0     0 ?        I<   Dec13   0:00 [scsi_tmf_0]
root         127  0.0  0.0      0     0 ?        S    Dec13   0:04 [scsi_eh_1]
root         128  0.0  0.0      0     0 ?        I<   Dec13   0:00 [scsi_tmf_1]
root         130  0.0  0.0      0     0 ?        I<   Dec13   0:00 [vfio-irqfd-clea]
root         131  0.0  0.0      0     0 ?        I<   Dec13   0:00 [ipv6_addrconf]
root         141  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kstrp]
root         144  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kworker/u5:0]
root         157  0.0  0.0      0     0 ?        I<   Dec13   0:00 [charger_manager]
root         158  0.0  0.0      0     0 ?        I<   Dec13   0:47 [kworker/1:1H-kblockd]
root         203  0.0  0.0      0     0 ?        S    Dec13   0:00 [scsi_eh_2]
root         204  0.0  0.0      0     0 ?        I<   Dec13   0:00 [scsi_tmf_2]
root         205  0.0  0.0      0     0 ?        I<   Dec13   0:00 [vmw_pvscsi_wq_2]
root         206  0.0  0.0      0     0 ?        I<   Dec13   0:00 [cryptd]
root         207  0.0  0.0      0     0 ?        I<   Dec13   0:44 [kworker/0:1H-kblockd]
root         218  0.0  0.0      0     0 ?        S    Dec13   1:24 [irq/16-vmwgfx]
root         220  0.0  0.0      0     0 ?        I<   Dec13   0:00 [ttm_swap]
root         265  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kdmflush]
root         292  0.0  0.0      0     0 ?        I<   Dec13   0:00 [raid5wq]
root         332  0.0  0.0      0     0 ?        S    Dec13  10:31 [jbd2/dm-0-8]
root         333  0.0  0.0      0     0 ?        I<   Dec13   0:00 [ext4-rsv-conver]
root         404  0.0  0.1 486168  7008 ?        S<s  Dec13   9:00 /lib/systemd/systemd-journald
root         436  0.0  0.0  21896  4020 ?        Ss   Dec13   0:05 /lib/systemd/systemd-udevd
root         628  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kaluad]
root         629  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kmpath_rdacd]
root         630  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kmpathd]
root         631  0.0  0.0      0     0 ?        I<   Dec13   0:00 [kmpath_handlerd]
root         632  0.0  0.4 345920 18328 ?        SLsl Dec13  13:24 /sbin/multipathd -d -s
root         642  0.0  0.0      0     0 ?        S    Dec13   0:00 [jbd2/sda2-8]
root         643  0.0  0.0      0     0 ?        I<   Dec13   0:00 [ext4-rsv-conver]
root         647  0.0  0.0      0     0 ?        S<   Dec13   0:00 [loop0]
root         652  0.0  0.0      0     0 ?        S<   Dec13   0:00 [loop2]
root         653  0.0  0.0      0     0 ?        S<   Dec13   0:00 [loop3]
root         654  0.0  0.0      0     0 ?        S<   Dec13   0:00 [loop4]
root         655  0.0  0.0      0     0 ?        S<   Dec13   0:00 [loop5]
root         656  0.0  0.0      0     0 ?        S<   Dec13   0:02 [loop6]
root         657  0.0  0.0      0     0 ?        S<   Dec13   0:00 [loop7]
systemd+     679  0.0  0.0  90424  3084 ?        Ssl  Dec13   0:02 /lib/systemd/systemd-timesyncd
root         687  0.0  0.0  46460  2036 ?        Ss   Dec13   0:00 /usr/bin/VGAuthService
root         688  0.0  0.0 235068  2848 ?        Ssl  Dec13  18:49 /usr/bin/vmtoolsd
systemd+     763  0.0  0.1  26920  4344 ?        Ss   Dec13   0:03 /lib/systemd/systemd-networkd
systemd+     765  0.0  0.1  24356  4244 ?        Ss   Dec13   0:05 /lib/systemd/systemd-resolved
root         779  0.0  0.0 238076  3184 ?        Ssl  Dec13   1:36 /usr/lib/accountsservice/accounts-daemon
root         782  0.0  0.0   5568  2264 ?        Ss   Dec13   0:01 /usr/sbin/cron -f
message+     783  0.0  0.0   7544  3852 ?        Ss   Dec13   0:01 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root         791  0.0  0.0  81960  2772 ?        Ssl  Dec13   1:49 /usr/sbin/irqbalance --foreground
root         795  0.0  0.0  26300  3352 ?        Ss   Dec13   0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
syslog       796  0.0  0.0 224348  3520 ?        Ssl  Dec13   1:55 /usr/sbin/rsyslogd -n -iNONE
root         798  0.0  0.2 932740  9820 ?        Ssl  Dec13   1:32 /usr/lib/snapd/snapd
root         807  0.0  0.0  16764  2980 ?        Ss   Dec13   0:02 /lib/systemd/systemd-logind
root         809  0.2  0.5 1654356 21700 ?       Ssl  Dec13  51:31 /usr/local/bin/teleport start --roles=node --config=/etc/teleport.yaml --pid-file=/run/teleport.pid
daemon       812  0.0  0.0   3792  1840 ?        Ss   Dec13   0:00 /usr/sbin/atd -f
root         842  0.0  0.0   2860  1448 tty1     Ss+  Dec13   0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
root         848  0.0  0.0  12176  2268 ?        Ss   Dec13   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root         862  0.0  0.0 105120  3276 ?        Ssl  Dec13   0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
root         912  0.0  0.0 236432  3568 ?        Ssl  Dec13   0:00 /usr/lib/policykit-1/polkitd --no-debug
mysql        919 10.6  4.6 1610736 185956 ?      Sl   Dec13 2294:41 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
root        2806  0.0  0.1 972708  4808 ?        Ssl  Dec13   2:42 /usr/bin/containerd
root        2982  0.0  0.2 1094216 11800 ?       Ssl  Dec13   3:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
root        3017  0.0  0.0   2488   520 ?        S    Dec13   0:00 bpfilter_umh
root        4241  0.0  0.0      0     0 ?        S    Dec13   0:03 [jbd2/sdb-8]
root        4242  0.0  0.0      0     0 ?        I<   Dec13   0:00 [ext4-rsv-conver]
root        4725  0.0  0.0 549044   764 ?        Sl   Dec13   0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9000 -container-ip 172.17.0.2 -container-port 9000
root        4739  0.0  0.0 111972   588 ?        Sl   Dec13   0:49 /usr/bin/containerd-shim-runc-v2 -namespace moby -id ccb3a7cfd700e3825b7894f839d47c266e067b9d18e33a6721a208edf90045b5 -address /run/containerd/containerd.soc
root        4761  0.3  1.3 892680 52996 ?        Ssl  Dec13  72:40 minio server /data
root       24915  0.0  0.0      0     0 ?        I<   Dec13   0:00 [xfsalloc]
root       24916  0.0  0.0      0     0 ?        I<   Dec13   0:00 [xfs_mru_cache]
root       24919  0.0  0.0      0     0 ?        S    Dec13   0:00 [jfsIO]
root       24920  0.0  0.0      0     0 ?        S    Dec13   0:00 [jfsCommit]
root       24921  0.0  0.0      0     0 ?        S    Dec13   0:00 [jfsCommit]
root       24922  0.0  0.0      0     0 ?        S    Dec13   0:00 [jfsSync]
root      236478  0.0  0.0      0     0 ?        S<   Dec16   0:00 [loop8]
root     2996823  0.6 77.5 3213920 3126564 ?     Ss   Dec27   1:28 /usr/local/mesh/meshagent
root     3210168  0.0  0.0      0     0 ?        I    00:00   0:00 [kworker/0:1-events]
root     3210169  0.0  0.0      0     0 ?        I    00:00   0:01 [kworker/0:3-events]
root     3210182  0.0  0.0      0     0 ?        I    00:00   0:00 [kworker/1:0-events]
root     3210183  0.0  0.0      0     0 ?        I    00:00   0:01 [kworker/1:3-events]
root     3257149  0.0  0.0      0     0 ?        I    00:36   0:00 [kworker/u4:3-events_power_efficient]
root     3267720  0.0  0.0      0     0 ?        I    00:44   0:00 [kworker/u4:1-events_unbound]
root     3290162  0.0  0.0      0     0 ?        I    01:02   0:00 [kworker/u4:0-scsi_tmf_1]
root     3298184  0.0  0.0      0     0 ?        I    01:09   0:00 [kworker/u4:2-events_power_efficient]
root     3300887  0.0  0.0   5992  3952 pts/0    Ss   01:12   0:00 bash
root     3302235  0.0  0.0   7648  3260 pts/0    R+   01:13   0:00 ps waux

krayon007 commented 3 years ago

When your agent is in this state, what does fdsnapshot and timerinfo return from the console tab?

zanderson-aim commented 3 years ago

Here is what is there right now, but I just rebooted it so I will post again. Usually takes about a hour to grow large

> fdsnapshot 
 Chain Timeout: 120045 milliseconds
 FD[13] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
 FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
 FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
 FD[12] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
 FD[14] (R: 0, W: 0, E: 0) => net.ipcServer
 FD[15] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
> timerinfo 
 Timer: 19.5 minutes  (0x14fd6910) [setInterval(), meshcore (InfoUpdate Timer)]

Current Status

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     3322186  0.5  8.2 344696 334256 ?       Ss   01:30   0:08 /usr/local/mesh/meshagent

zanderson-aim commented 3 years ago

I got a different Agent running that looks to be showing the same thing, here is that info

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      7557  0.1 34.6 743600 708044 ?       Ssl  Dec27   1:50 /usr/local/mesh/meshagent

Server Info

root@pnap-k8s-utility-01:/usr/local/mesh# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.5 LTS
Release:        18.04
Codename:       bionic
root@pnap-k8s-utility-01:/usr/local/mesh# uname -a
Linux pnap-k8s-utility-01.pnap.aimitservices.com 4.15.0-123-generic #126-Ubuntu SMP Wed Oct 21 09:40:11 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Console

> fdsnapshot
 Chain Timeout: 91474 milliseconds
 FD[12] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
 FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
 FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
 FD[13] (R: 0, W: 0, E: 0) => net.ipcServer
 FD[17] (R: 0, W: 0, E: 0) => (stderr) childProcess (pid=26415), Remote Terminal
 FD[19] (R: 0, W: 0, E: 0) => (stdout) childProcess (pid=26415), Remote Terminal
 FD[14] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
 FD[15] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
 FD[16] (R: 0, W: 0, E: 0) => https.WebSocketStream, MeshAgent_relayTunnel, Remote Terminal
> timerinfo 
 Timer: 19.2 minutes  (0x2c971000) [setInterval(), meshcore (InfoUpdate Timer)]

zanderson-aim commented 3 years ago

If it matters I have MC setup inside of Kubernetes

Matt-CyberGuy commented 3 years ago

I'm seeing this happen again on the Debian install I mentioned at the beginning of this thread. Oddly, there are also two meshagent sessions running.

fdsnapshot

Matt-CyberGuy commented 3 years ago

Huh, interesting, I don't know if this is helpful or not, but I just tried to connect to the endpoint over MC and it looks like meshagent crashed, or restarted. It took a minute, but it came back up, memory usage is now back to normal. Below is the full fdsnapshot and timerinfo from before and after

There's definitely some kind of loop happening while the agent is inactive.

krayon007 commented 3 years ago

Interesting... Your agent didn't create a dump file did it? Normally (on linux anyways), its configured so that if the agent crashes it should restart immediately. I'll see if I can do some testing with sleep states, to see if anything screwy happens with the linux agent if the platform goes to standby and such.

Matt-CyberGuy commented 3 years ago

Found it... maybe not the dump, but the main log file at least.

[2020-12-24 12:28:45 AM] Info: No certificate was found in db
[2020-12-26 03:00:54 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7faf23a37840]
/usr/local/mesh/meshagent() [0x426bec]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7faf23a2409b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-26 04:15:57 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7fd1a04fc840]
/usr/local/mesh/meshagent() [0x426bec]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fd1a04e909b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-26 08:55:50 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7ff4d1b8d840]
/usr/local/mesh/meshagent() [0x426bec]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7ff4d1b7a09b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-26 02:22:08 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f9ffe542840]
/usr/local/mesh/meshagent() [0x426bec]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f9ffe52f09b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-26 03:26:48 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7ff4bb219840]
/usr/local/mesh/meshagent() [0x426be9]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7ff4bb20609b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-26 04:13:31 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7fb3fa07e840]
/usr/local/mesh/meshagent() [0x426be9]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fb3fa06b09b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-26 06:34:35 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f9027487840]
/usr/local/mesh/meshagent() [0x426bf0]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f902747409b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-26 09:00:23 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f553cfc6840]
/usr/local/mesh/meshagent() [0x426be9]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f553cfb309b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-26 11:05:10 PM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f1dee07d840]
/usr/local/mesh/meshagent() [0x426bf0]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f1dee06a09b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-27 12:16:01 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f5844d1c840]
/usr/local/mesh/meshagent() [0x426bf0]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f5844d0909b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-27 06:50:13 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f857c050840]
/usr/local/mesh/meshagent() [0x426be9]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f857c03d09b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-27 08:16:58 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7fc4197ce840]
/usr/local/mesh/meshagent() [0x426be9]
/usr/local/mesh/meshagent() [0x4cce9b]
/usr/local/mesh/meshagent() [0x420023]
/usr/local/mesh/meshagent() [0x41e849]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fc4197bb09b]
/usr/local/mesh/meshagent() [0x40daf1]

[2020-12-29 08:57:37 AM] ** CRASH **
[/usr/local/mesh/meshagent_C257887C2B841C1F]
/lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7fe244598840]
/usr/local/mesh/meshagent() [0x43d310]
/usr/local/mesh/meshagent() [0x4cfd10]
/usr/local/mesh/meshagent() [0x446371]
/usr/local/mesh/meshagent() [0x446d92]
/usr/local/mesh/meshagent() [0x44d96e]
/usr/local/mesh/meshagent() [0x44dcf5]
/usr/local/mesh/meshagent() [0x49950f]
/usr/local/mesh/meshagent() [0x446371]
/usr/local/mesh/meshagent() [0x47689a]
/usr/local/mesh/meshagent() [0x4456ef]
/usr/local/mesh/meshagent() [0x445fbf]
/usr/local/mesh/meshagent() [0x446d92]
/usr/local/mesh/meshagent() [0x44d96e]
/usr/local/mesh/meshagent() [0x44dcf5]
/usr/local/mesh/meshagent() [0x49950f]
/usr/local/mesh/meshagent() [0x446371]
/usr/local/mesh/meshagent() [0x446d92]
/usr/local/mesh/meshagent() [0x44d96e]
/usr/local/mesh/meshagent() [0x44dcf5]
/usr/local/mesh/meshagent() [0x4918dd]
/usr/local/mesh/meshagent() [0x41e753]
/usr/local/mesh/meshagent() [0x4ce397]
/usr/local/mesh/meshagent() [0x4d56c1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fe24458509b]
/usr/local/mesh/meshagent() [0x40daf1]

Matt-CyberGuy commented 3 years ago

Is there a command to run the agent to run in a debug mode?

zanderson-aim commented 3 years ago

I fixed a issue with agents connecting/disconnecting all the time (AgentPing set to 30 now). This seems to have fixed the RAM usage for now, but I will check back tomorrow. Could a agent connecting/disconnecting all the time cause this issue?

Matt-CyberGuy commented 3 years ago

That sounds great... is this a setting for the config.plist? I have an AgentPong setting in there that's currently set to 800, but no AgentPing

zanderson-aim commented 3 years ago

Ya, it's not in the default configuration but you can add it. I went off this post to fix the agent issue.https://github.com/Ylianst/MeshCentral/issues/2050

zanderson-aim commented 3 years ago

Looking good. root 1463310 0.0 0.7 42016 30164 ? Ssl 00:00 0:00 /usr/local/mesh/meshagent 2 Hours Later root 1463310 0.0 0.7 42148 28232 ? Ssl 00:00 0:00 /usr/local/mesh/meshagent

zanderson-aim commented 3 years ago

14 Hour laters, still looking good. root 1463310 0.0 0.7 42420 29992 ? Ssl 00:00 0:01 /usr/local/mesh/meshagent

zanderson-aim commented 3 years ago

Here is my server config as well

      {
        "$schema": "http://info.meshcentral.com/downloads/meshcentral-config-schema.json",
        "settings": {
          "mongodb": "mongodb://..../meshcentral",          
          "cert": "$SITEURL",
          "WANonly": true,
          "Minify": 1,
          "agentIdleTimeout": 3600,
          "AllowHighQualityDesktop": true,
          "AgentPing": 30,
          "TlsOffload": true,
          "trustedProxy": true,
          "AliasPort": 443,
          "Port": 4430
        },
        "domains": {
          "": {
            "title": "AIM IT Services",
            "title2": "PNAP",
            "certUrl": "https://$SITEURL:443/",
            "agentConfig": [ "webSocketMaskOverride=1" ],
            "NewAccounts": 1,
            "authStrategies": {
              "saml": {
                "newAccounts": true,
                "callbackurl": "https://$SITEURL/auth-saml-callback",
                "entityid": "$SITEURL",
                "idpurl": "https://$IDURL/protocol/saml",
                "cert": "saml-aim.pem"
              }
            }
          }
        }
      }

Matt-CyberGuy commented 3 years ago

I made the changes to the config.json file and restarted the server, but 2 of the clients I had meshagent on still had a memory leak.

I also noticed, they both had 2 persistent instances of meshagent running, is that normal?

krayon007 commented 3 years ago

I made the changes to the config.json file and restarted the server, but 2 of the clients I had meshagent on still had a memory leak.

I also noticed, they both had 2 persistent instances of meshagent running, is that normal?

No. While it's showing two instances running, what does "fdsnapshot" return on the agent?

Matt-CyberGuy commented 3 years ago

Luckily this firewall has 8G of memory, so the leak hasn't been causing any issues that I know of, but this is one I have 2 instances on.

It almost looks like they're the same instance, but the PIDs are different. Here's the fdsnapshot

> fdsnapshot
 Chain Timeout: 119659 milliseconds
 FD[12] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
 FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
 FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
 FD[13] (R: 0, W: 0, E: 0) => net.ipcServer
 FD[15] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
 FD[14] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4

krayon007 commented 3 years ago

That's really wierd. That snapshot shows that the agent doesn't think it has any child processes. It looks completely idle. Are you able to run the following command to see what the command line parameters of those two processes are?

ps -ef | grep meshagent

Matt-CyberGuy commented 3 years ago

[root @ firewall] ~ # ps -ef | grep meshagent

root      60286      1  0 Jan12 ?        00:02:35 /usr/local/mesh/meshagent
root     117892 113721  0 09:25 pts/1    00:00:00 grep meshagent

[root @ firewall] ~ #

krayon007 commented 3 years ago

[root @ firewall] ~ # ps -ef | grep meshagent

root      60286      1  0 Jan12 ?        00:02:35 /usr/local/mesh/meshagent
root     117892 113721  0 09:25 pts/1    00:00:00 grep meshagent

[root @ firewall] ~ #

That looks like there's only a single instance running?

Matt-CyberGuy commented 3 years ago

Right?... I don't absolutely need these endpoints in MeshCentral, but it is a nice convenience... I'm just wondering why some systems have these leaks and some don't, feels kind of random

krayon007 commented 3 years ago

You said your firewall is running inside a docker? Can you share your dockerfile? Maybe I can reproduce it over here, and try to figure it out?

Matt-CyberGuy commented 3 years ago

Sorry, I don't know if I stated our setup incorrectly... I have MeshCentral running in a docker instance, it sits behind an NGINX reverse proxy, I'll go ahead and post the compose file I've been using and the MeshCentral config

version: '3.5'

services:

  MeshCentral:
    image: uldiseihenbergs/meshcentral
    container_name: MeshCentral
    restart: always
    cap_add:
      - NET_ADMIN
    ports:
    - 4433:4433
    volumes:
    - /opt/MeshCentral/files/:/home/node/meshcentral/meshcentral-files
    - /opt/MeshCentral/data/:/home/node/meshcentral/meshcentral-data
    - /opt/MeshCentral/backup/:/home/node/meshcentral/meshcentral-backup
    - /opt/MeshCentral/web/:/home/node/meshcentral/meshcentral-web
    networks:
      Network-Bridge:
        ipv4_address: 172.0.0.150

networks:
  Network-Bridge:
    driver: bridge
    name: Network-Bridge
    ipam:
     config:
       - subnet: 172.0.0.0/16

{
  "$schema": "http://info.meshcentral.com/downloads/meshcentral-config-schema.json",
  "settings": {
    "cert": "xxxxxx",
    "WANonly": true,
    "selfupdate": true,
    "noagentupdate": true,
    "webrtc": true,
    "Port": 443,
    "redirPort": 80,
    "AgentPing": 30,
    "redirAliasPort": 80,
    "TlsOffload": "172.0.0.101"
  },
  "domains": {
    "": {
      "title": "Cylanda",
      "title2": "<br/>Cybersecurity | LAN | Data<h2>Remote Monitoring Portal</h2>",
      "welcomeText": "Please enter your email and given password above.<br/>If you are experiencing difficulty connecting to endpoints, need to add/remove users or systems, <br/>or you would like to have you...",
      "titlePicture": "title-cylanda.png",
      "minify": true,
      "newAccounts": false,
      "userNameIsEmail": true,
      "PasswordRequirements": { "min": 8, "max": 64, "upper": 1, "lower": 1, "numeric": 1, "nonalpha": 1, "hint": true, "reset": 90, "force2factor": true },
      "ManageAllDeviceGroups":["xxxxx@xxxxx.com"],
      "Footer": "<a target=_blank href='https://xxxx.com'>xxxx</a>",
      "certUrl": "https://xxx.xxxx.com",
      "userAllowedIp" : "xxx.xxx.xxx.xxx",
    }
  },
  "_letsencrypt": {
    "__comment__": "Requires NodeJS 8.x or better, Go to https://letsdebug.net/ first before trying Let's Encrypt.",
    "email": "xxx@xxxx.com",
    "names": "xxx.xxxx.com",
    "production": false
  }
}

zanderson-aim commented 3 years ago

Anything I can do to help here? It appears for me at least fixing the agents from constantly disconnecting fix my RAM issues.

Matt-CyberGuy commented 3 years ago

Unfortunately, unless you see something in my config above that could change, I don't know what else to do. even with the ping time set to 30, the same endpoints having the memory leak were still having issues

krayon007 commented 3 years ago

Unfortunately, unless you see something in my config above that could change, I don't know what else to do. even with the ping time set to 30, the same endpoints having the memory leak were still having issues

What's the software/hardware configuration of your endpoints having issue? ie, what linux distribution is it running? I can try to see if I can create a VM to recreate your client setup to see if I can reproduce these leaks you are seeing.

Matt-CyberGuy commented 3 years ago

I've been running it on Untangle linux based firewalls. Most seem to run the agent fine, and for the most part, all of the endpoints are configured the same.

The easiest thing to do for downloading is to go to their wiki page. https://wiki.untangle.com/index.php/NG_Firewall_Downloads

krayon007 commented 3 years ago

I've been running it on Untangle linux based firewalls. Most seem to run the agent fine, and for the most part, all of the endpoints are configured the same.

The easiest thing to do for downloading is to go to their wiki page. https://wiki.untangle.com/index.php/NG_Firewall_Downloads

Ok, I was actually able to reproduce the leak when I configured Untangle in a VM. I found that only my Untangle VM showed this leak... When I dug around, it turns out that I have a file watcher on /var/run/utmp. This normally gets modified when someone logs in or out of the system... I found that on Untangle (at least the one I had configured in my VM), behaved much differently... Even when nothing was running, /var/run/utmp always showed the current time, meaning something was touching it every second... Because it was getting touched every second, this caused my file watcher to trigger every second... Because the normal logic is that this is triggered on login/logoff events, a child process was spawned to query that information....The leak itself, was actually a referencing issue with one of the javascript objects, that did not delete a table entry, it only set the value to null... This cauased the object to get GC'ed, but left the table entry, causing the table to slowly grow over time...

We plan on fixing this in two different ways... One, is to detect the utmp situation, and prevent a child process from spawning every second... This can be accomplished with just a server side update. The actual fix to the leak itself, will require an agent update, so that will come separately... The server side fix should take care of this "leak" in the short term. Especially since it probably isn't a good idea to spawn a process every second for no reason, lol... But I have no idea why /var/run/utmp is behaving this way on untangle. I verified with several other distros, and none of them exhibit this behavior.

Matt-CyberGuy commented 3 years ago

Awesome! I'm so glad you were able to track it down!

I don't know if you had it on or not, but one of the best free features of Untangle is the intrusion detection system. It's not very interactive compared to a standalone dedicated IDS, but what it does is powerful. It unfortunately can also be aggressive and is not a set-it-and-forget-it type of deal, so it can cause random network issues that have to be smoothed out over time. I'm wondering if it might have had anything to do with it, or if some other security subsystem might be to blame.

Matt-CyberGuy commented 3 years ago

Wow, this is such a massive difference. Even while in use, the footprint of the meshagent is tiny! Thank you so much for this fix!!

hellofaduck commented 3 years ago

I detected memory leak too on my debian servers with latest mesh version 0.8.36. It happens on my proxmox server(debian based), on my proxmox backup server(debian based), it happens inside LXC containers (debian based). It leaks around 1% of RAM(6 GB) every 24Hr. What debug info i need to post here for debug? P.S All my servers behid 1 router with Hairpin NAT for Mesh configured, but one on remote location with direct access to mesh through internet.

zanderson-aim commented 3 years ago

I do still noticed this every now and then. It normally happens when I have issues on the server side. I'm running it under K8s behind the NGINX ingress controller. When there is a configuration change in the cluster, this can cause the controller process (nginx) reload and move things around. When that happens I always have a few agents that spike up in RAM usage, I have since added a Zabbix alert and I just restart them (the service).

While not a fix, I did update the systemd service file to have memory limits /lib/systemd/system/meshagent.service

[Unit]
Description=MeshCentral Agent
[Service]
WorkingDirectory=/usr/local/mesh
ExecStart=/usr/local/mesh/meshagent
StandardOutput=null
Restart=always
RestartSec=3
MemoryAccounting=true
MemoryHigh=200M
MemoryMax=300M
[Install]
WantedBy=multi-user.target
Alias=meshagent.service

zanderson-aim commented 3 years ago

Here is the monitoring of one of the agents

hellofaduck commented 3 years ago

Hmm,interesting my MC behind NGINX too, and thanx for service config!

zanderson-aim commented 3 years ago

Still having the issue but I swapped from NGINX Ingress to a HAProxy ingress and it seems to have helped. Nginx reload will trigger disconnects which I think was causing the spikes in RAM. HAproxy can reload without disconnecting clients. In the kubernetes cluster anytime there is a change in pods it triggers a reload of the ingress controller which happens all the time.

Red Arrow is where I swapped ingress

Ylianst / MeshCentral

Debian Agent Memory Leak #2040