Ylianst / MeshCentral

A complete web-based remote monitoring and management web site. Once setup you can install agents and perform remote desktop session to devices on the local network or over the Internet.
https://meshcentral.com
Apache License 2.0
4.25k stars 568 forks source link

linux meshagent CPU sometimes running 100% #988

Open edlins opened 4 years ago

edlins commented 4 years ago

Hello,

I have mesh 0.4.9-k with some armv7 linux devices running meshagent that sometimes run a sustained 100% of the CPU. Normally they run abou 17-22% but when they run 100% the devices heat up and crash (no fans or heatsinks). Is there something I can check when I see an agent consuming 100%? IIRC systemctl restart meshagent returns it to normal. I suppose I could just cron that command or run monit but just wanted to see if there's anything else I can check. Thanks!!

Ylianst commented 4 years ago

This is one for Bryan, but a few questions since he will want to know.

Are you doing remote desktop on this device? Is remote desktop possible or is there no X-Windows installed? When you see the CPU going up, are there two "meshagent" processes and one of them is taking the CPU? Or is there only one MeshAgent process? Is this 32 or 64bit agent? Is this a RaspberryPi?

Thanks, Ylian

edlins commented 4 years ago

Are you doing remote desktop on this device?

Yes, but normally it's only about 20% with the remote desktop.

When you see the CPU going up, are there two "meshagent" processes and one of them is taking the CPU? Or is there only one MeshAgent process?

One. This was captured a few seconds prior to system crash due to overheat:

Feb 29 15:01:02 firefly root: top - 15:01:02 up 2 days,  3:08,  2 users,  load average: 1.20, 1.22, 1.19
Feb 29 15:01:02 firefly root: Tasks: 199 total,   2 running, 197 sleeping,   0 stopped,   0 zombie
Feb 29 15:01:02 firefly root: %Cpu(s): 14.0 us,  2.5 sy,  0.0 ni, 83.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
Feb 29 15:01:02 firefly root: KiB Mem :  2045124 total,   688464 free,   367800 used,   988860 buff/cache
Feb 29 15:01:02 firefly root: KiB Swap:        0 total,        0 free,        0 used.  1549636 avail Mem
Feb 29 15:01:02 firefly root:
Feb 29 15:01:02 firefly root:   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
Feb 29 15:01:02 firefly root:   481 root      20   0   18904   8028   4100 R 100.0  0.4   1651:25 meshagent
Feb 29 15:01:02 firefly root:   689 root      20   0  250964 135824  82908 S   5.9  6.6  62:29.58 Xorg
Feb 29 15:01:02 firefly root: 15480 root      20   0    6708   2636   2212 R   5.9  0.1   0:00.03 top

Is this 32 or 64bit agent?

$ file /usr/local/mesh/meshagent
/usr/local/mesh/meshagent: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=12fc23e1c18af7bab2ea9155f4fc9039a623352e, stripped

Is this a RaspberryPi?

Nope, it's a RockChip board. Pi's at least have heatsinks..

EDIT: FWIW it's a quad-core armv7 rev 1 (v7l). Might make the top output more understandable (83.4% idle).

Ylianst commented 4 years ago

Let me know if your RockChip board is commonly available and what OS you are running. Maybe I can order the same.

edlins commented 4 years ago

The vendor claims it's based on an RK3288 SoC [https://en.wikipedia.org/wiki/Rockchip_RK3288]() but it's not a dev board. It's in a production integrated unit: [https://www.prodvx.com/products/appc-10slb](). The OS is firefly linux: Linux firefly 4.4.154 #1 SMP Thu Dec 12 16:03:05 CST 2019 armv7l armv7l armv7l GNU/Linux I only see this sporadically, like maybe once a week or so. Is there additional logging or a verbose mode I can enable? IIRC I was able to resolve using $ sudo systemctl restart meshagent.

edlins commented 4 years ago

Just a thought, I could run meshagent through strace or ptrace if that would be useful. The logs would have to be rapidly rotated..

edlins commented 4 years ago

I've got the problem case live right now. This is what I have so far. Most recent prior reboot at 11:23:

Feb 27 11:23:33 firefly kernel: [    0.000000] Initializing cgroup subsys cpuset

meshagent initially running as PID 3528, switches to PID 12776 (fine), switches to PID 467 (hot):

Feb 27 11:46:02 firefly root:  3528 root      20   0   24784  10908   2472 S  23.5  0.5   0:01.46 meshagent
Feb 27 12:46:01 firefly root:  3528 root      20   0   27356   9480   2472 R  11.8  0.5  13:17.18 meshagent
Feb 27 13:29:02 firefly root: 12776 root      20   0   28940  11876   3196 R  23.5  0.6   0:16.10 meshagent
Feb 27 14:00:01 firefly root: 12776 root      20   0   23688   9740   2408 S  29.4  0.5   9:25.06 meshagent
Feb 27 14:07:02 firefly root:   467 root      20   0   19152   8056   4092 R 100.0  0.4   0:51.45 meshagent

The board has heated up enough now that mmcblk2 is throwing errors and the fs has gone read-only. So I'm not sure of the validity of any testing I can do right now. But I'm investigating.

EDIT: Interesting side point: with a read-only root fs I can't ssh in, I can't mesh the desktop, but I CAN mesh a terminal so that's my only working remote access..

Ylianst commented 4 years ago

Seems like Windows 32bit is doing the same: https://github.com/Ylianst/MeshCentral/issues/1002

edlins commented 4 years ago

For now I'm using ps-watcher to systemctl restart meshagent when process lifetime %CPU goes over 25 (typically runs around 22). But that's not ideal because it's lifetime pcpu (as reported by ps), not pcpu since last check (as reported by top). monit doesn't work because it does a first-match on process name but it's the child process (usually second-match) that runs hot.

If there is anything else you want me to check let me know. I imagine a debug flag for meshagent would be useful here. systemd captures all stdout and stderr and routes it the syslog (actually the journal...). Also our server and the agents are quickly upgradeable without jumping through a lot of hoops.

Ylianst commented 4 years ago

Are you using plugin's on your MeshCentral server? Just checking as there is reports of that causing 100% CPU in #1002. Thanks.

edlins commented 4 years ago

Nope, no plugins. Sorry for the delay. Lots of notifications. Still running ps-watcher but it hasn't been triggered for a week on either of two devices.

krayon007 commented 4 years ago

Is the agent idle when the CPU is at 100%? The original post mentioned remote desktop, so I was curious if this behavior you are seeing is only when kvm is started, or just happens by itself with no interaction..

edlins commented 4 years ago

It's a little hard for me to know as we have multiple employees logging in and out. Does mesh log new desktop connections anywhere so I can correlate with my monitoring logs? I would think that would be a great thing to send to syslog. EDIT: just saw that I can review the event log tab, will do that.

Also wanted to correct my earlier statement of one vs two processes: I'm not sure. My monitor was only pulling the top processes. Only one process was at 100% but there may have been a second process. I have already modified my monitor to search for all meshagent processes.

Lastly, I am having issues with the hardware where the mmc block device is throwing errors and the kernel has remounted the root filesystem read-only. This breaks everything, but amazingly mesh desktop, terminal, and file xfer all still work. So, nice work not depending on a read-write root filesystem. Anyway, I can't eliminate the hardware as a source of this problem for now. And I have a workaround with ps-watcher so I consider this relatively low priority.

EDIT: Just now the fs was read-only and meshagent was working but running 100%. I did a systemctl restart meshagent which immediately booted me (expected) but the agent didn't come back online. :( I was connected via terminal. Not sure if anyone was connected via desktop.

krayon007 commented 4 years ago

If you click on the event log tab in the browser, it should show all of the times when remote desktop, or terminal, or files were connected, and by which user.

mjakovina commented 4 years ago

We have same issue with Meshagent running 100% of 1 CPU core. Few notes:

  1. It happens both with Linux nad Windows (64bit, not sure about 32bit).
  2. When it happens I notice 2 Meshagent processes, one of them with high CPU usage
  3. We do not use any plugins
krayon007 commented 4 years ago

When you look at the processes with ps, what parameters were the second meshagent process started with?

mjakovina commented 4 years ago

I can document it next time I notice the problem