Open edlins opened 4 years ago
This is one for Bryan, but a few questions since he will want to know.
Are you doing remote desktop on this device? Is remote desktop possible or is there no X-Windows installed? When you see the CPU going up, are there two "meshagent" processes and one of them is taking the CPU? Or is there only one MeshAgent process? Is this 32 or 64bit agent? Is this a RaspberryPi?
Thanks, Ylian
Are you doing remote desktop on this device?
Yes, but normally it's only about 20% with the remote desktop.
When you see the CPU going up, are there two "meshagent" processes and one of them is taking the CPU? Or is there only one MeshAgent process?
One. This was captured a few seconds prior to system crash due to overheat:
Feb 29 15:01:02 firefly root: top - 15:01:02 up 2 days, 3:08, 2 users, load average: 1.20, 1.22, 1.19
Feb 29 15:01:02 firefly root: Tasks: 199 total, 2 running, 197 sleeping, 0 stopped, 0 zombie
Feb 29 15:01:02 firefly root: %Cpu(s): 14.0 us, 2.5 sy, 0.0 ni, 83.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
Feb 29 15:01:02 firefly root: KiB Mem : 2045124 total, 688464 free, 367800 used, 988860 buff/cache
Feb 29 15:01:02 firefly root: KiB Swap: 0 total, 0 free, 0 used. 1549636 avail Mem
Feb 29 15:01:02 firefly root:
Feb 29 15:01:02 firefly root: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
Feb 29 15:01:02 firefly root: 481 root 20 0 18904 8028 4100 R 100.0 0.4 1651:25 meshagent
Feb 29 15:01:02 firefly root: 689 root 20 0 250964 135824 82908 S 5.9 6.6 62:29.58 Xorg
Feb 29 15:01:02 firefly root: 15480 root 20 0 6708 2636 2212 R 5.9 0.1 0:00.03 top
Is this 32 or 64bit agent?
$ file /usr/local/mesh/meshagent
/usr/local/mesh/meshagent: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 3.2.0, BuildID[sha1]=12fc23e1c18af7bab2ea9155f4fc9039a623352e, stripped
Is this a RaspberryPi?
Nope, it's a RockChip board. Pi's at least have heatsinks..
EDIT: FWIW it's a quad-core armv7 rev 1 (v7l). Might make the top
output more understandable (83.4% idle).
Let me know if your RockChip board is commonly available and what OS you are running. Maybe I can order the same.
The vendor claims it's based on an RK3288 SoC [https://en.wikipedia.org/wiki/Rockchip_RK3288]() but it's not a dev board. It's in a production integrated unit: [https://www.prodvx.com/products/appc-10slb](). The OS is firefly linux: Linux firefly 4.4.154 #1 SMP Thu Dec 12 16:03:05 CST 2019 armv7l armv7l armv7l GNU/Linux
I only see this sporadically, like maybe once a week or so. Is there additional logging or a verbose mode I can enable? IIRC I was able to resolve using $ sudo systemctl restart meshagent
.
Just a thought, I could run meshagent
through strace
or ptrace
if that would be useful. The logs would have to be rapidly rotated..
I've got the problem case live right now. This is what I have so far.
Most recent prior reboot at 11:23
:
Feb 27 11:23:33 firefly kernel: [ 0.000000] Initializing cgroup subsys cpuset
meshagent
initially running as PID 3528
, switches to PID 12776
(fine), switches to PID 467
(hot):
Feb 27 11:46:02 firefly root: 3528 root 20 0 24784 10908 2472 S 23.5 0.5 0:01.46 meshagent
Feb 27 12:46:01 firefly root: 3528 root 20 0 27356 9480 2472 R 11.8 0.5 13:17.18 meshagent
Feb 27 13:29:02 firefly root: 12776 root 20 0 28940 11876 3196 R 23.5 0.6 0:16.10 meshagent
Feb 27 14:00:01 firefly root: 12776 root 20 0 23688 9740 2408 S 29.4 0.5 9:25.06 meshagent
Feb 27 14:07:02 firefly root: 467 root 20 0 19152 8056 4092 R 100.0 0.4 0:51.45 meshagent
The board has heated up enough now that mmcblk2
is throwing errors and the fs has gone read-only. So I'm not sure of the validity of any testing I can do right now. But I'm investigating.
EDIT: Interesting side point: with a read-only root fs I can't ssh
in, I can't mesh the desktop, but I CAN mesh a terminal so that's my only working remote access..
Seems like Windows 32bit is doing the same: https://github.com/Ylianst/MeshCentral/issues/1002
For now I'm using ps-watcher
to systemctl restart meshagent
when process lifetime %CPU goes over 25 (typically runs around 22). But that's not ideal because it's lifetime pcpu
(as reported by ps
), not pcpu
since last check (as reported by top
). monit
doesn't work because it does a first-match on process name but it's the child process (usually second-match) that runs hot.
If there is anything else you want me to check let me know. I imagine a debug flag for meshagent
would be useful here. systemd
captures all stdout
and stderr
and routes it the syslog
(actually the journal
...). Also our server and the agents are quickly upgradeable without jumping through a lot of hoops.
Are you using plugin's on your MeshCentral server? Just checking as there is reports of that causing 100% CPU in #1002. Thanks.
Nope, no plugins. Sorry for the delay. Lots of notifications. Still running ps-watcher
but it hasn't been triggered for a week on either of two devices.
Is the agent idle when the CPU is at 100%? The original post mentioned remote desktop, so I was curious if this behavior you are seeing is only when kvm is started, or just happens by itself with no interaction..
It's a little hard for me to know as we have multiple employees logging in and out. Does mesh log new desktop connections anywhere so I can correlate with my monitoring logs? I would think that would be a great thing to send to syslog. EDIT: just saw that I can review the event log tab, will do that.
Also wanted to correct my earlier statement of one vs two processes: I'm not sure. My monitor was only pulling the top processes. Only one process was at 100% but there may have been a second process. I have already modified my monitor to search for all meshagent processes.
Lastly, I am having issues with the hardware where the mmc block device is throwing errors and the kernel has remounted the root filesystem read-only. This breaks everything, but amazingly mesh desktop, terminal, and file xfer all still work. So, nice work not depending on a read-write root filesystem. Anyway, I can't eliminate the hardware as a source of this problem for now. And I have a workaround with ps-watcher
so I consider this relatively low priority.
EDIT: Just now the fs was read-only and meshagent was working but running 100%. I did a systemctl restart meshagent
which immediately booted me (expected) but the agent didn't come back online. :( I was connected via terminal. Not sure if anyone was connected via desktop.
If you click on the event log tab in the browser, it should show all of the times when remote desktop, or terminal, or files were connected, and by which user.
We have same issue with Meshagent running 100% of 1 CPU core. Few notes:
When you look at the processes with ps, what parameters were the second meshagent process started with?
I can document it next time I notice the problem
Hello,
I have mesh 0.4.9-k with some armv7 linux devices running meshagent that sometimes run a sustained 100% of the CPU. Normally they run abou 17-22% but when they run 100% the devices heat up and crash (no fans or heatsinks). Is there something I can check when I see an agent consuming 100%? IIRC
systemctl restart meshagent
returns it to normal. I suppose I could justcron
that command or runmonit
but just wanted to see if there's anything else I can check. Thanks!!