Closed ryszard-suchocki closed 2 years ago
Where are your agents hosted?
Could you clarify in more simple words? The whole setup works in a simple environment, in LAN. Linux Agent works on a physical machine with "direct" access to TRMM. Other agents (Win) communicate fine (local and remote).
ok, I am having issues with amazon agents but fine for all others
Could you elaborate on how you register agents? My approach was:
you need to keep it running via systemd or something similar on your distro
Installed using the above script with code-signed agents. Workig fine on a Ubuntu 20.04 test VM I made on my local VMware Workstation with no issues. Then deployed it on some AWS and Azure VMs I have (a mix of Ubuntu 20.04 and CentOS 7), and having the issue described in OP where they're going offline after a few minutes after running their first checks. The agents are running in systemd as suggested, and systemctl restart tacticalagent.service
will bring them back to "online" status in the dashboard, but they slowly go back to offline again. Curious what to try next.
Edit: Further information about some examples of agents below
Agent that's working fine: Ubuntu 20.04 x86_64 5.4.0-105-generic • Agent v2.0.0 AWS Ubuntu agent that's going offline: Ubuntu 20.04 x86_64 5.13.0-1017-aws • Agent v2.0.0 Azure Ubuntu agent that's going offline: Ubuntu 20.04 x86_64 5.13.0-1017-azure • Agent v2.0.0 Azure CentOS agent that's going offline: Centos 7.9.2009 x86_64 3.10.0-1160.53.1.el7.x86_64 • Agent v2.0.0
Happy to provide any other troubleshooting information as-needed.
@georgebarnick please enable debug logging so we can see where it's getting stuck
modify /etc/systemd/system/tacticalagent.service
and change
ExecStart=/usr/local/bin/tacticalagent -m svc
to
ExecStart=/usr/local/bin/tacticalagent -m svc -log debug
(add the -log debug
)
then systemctl daemon-reload && systemctl restart tacticalagent
wait for agent to go offline then lets see what's in /var/log/tacticalagent.log
@wh1te909 So far the only things in the log after the agent service restarts and goes through its checks and everything the first time is:
time="2022-03-21T20:02:23Z" level=debug msg="Checkrunner sleeping for 120"
every few minutes and
time="2022-03-21T20:02:24Z" level=debug msg="{Status:{Cmd:/opt/tacticalmesh/meshagent PID:0 Complete:false Exit:-1 Error:fork/exec /opt/tacticalmesh/meshagent: no such file or directory StartTs:1647892944163829150 StopTs:1647892944164151273 Runtime:0 Stdout:[] Stderr:[]} Stdout: Stderr:}\n"
every second.
I installed with the --nomesh
flag on most if not all of these VMs that are going offline. Not sure if that's going to be related to the agent going offline or a separate issue, but maybe @ryszard-suchocki can chime in if he has the Mesh Agent with his affected install or not. The reason I did --nomesh
was that the install seemed to get stuck on the "Getting mesh node id" step on one of them, so I just decided to omit it from all of them. I could try to reinstall with the mesh agent if you need and have an idea on why it might have gotten stuck there. I'm no expert with MeshCentral yet so haven't troubleshot that myself.
In my case, Mesh Agent has been installed before, separately to TRMM. I did not use -nomesh
parameter when "installing" TRMM. So I decided to remove my agent and "install" it by passing -nomesh
and -log debug
parameters. Although -nomesh
parameter the log file got filled by:
896886173760991 Runtime:0 Stdout:[] Stderr:[]} Stdout: Stderr:}\n"
time="2022-03-21T22:08:07+01:00" level=debug msg="{Status:{Cmd:/opt/tacticalmesh/meshagent PID:0 Complete:false Exit:-1 Error:fork/exec /opt/tacticalmesh/meshagent: no such file or directory StartTs:1647896887174316611
so I decided to manually copy the meshagent executable to specified folder (which had not exist, need to be created manually). Now log look like below and agent status is correct, the last response time is updated correctly
time="2022-03-21T22:08:08+01:00" level=debug
time="2022-03-21T22:08:08+01:00" level=debug msg="{Status:{Cmd:/opt/tacticalmesh/meshagent PID:249577 Complete:true Exit:0 Error:<nil> StartTs:1647896888175850463 StopTs:1647896888267528965 Runtime:0.091678527 Stdout:[] Stderr:[]} Stdout:\n Stderr:}\n"
time="2022-03-21T22:08:10+01:00" level=debug msg="Checking for windows updates"
time="2022-03-21T22:08:32+01:00" level=debug msg="Checkrunner sleeping for 120"
time="2022-03-21T22:09:06+01:00" level=debug msg="agent-hello {jX****************************cyixu 2.0.0}"
time="2022-03-21T22:10:02+01:00" level=debug msg="agent-hello {jX****************************cyixu 2.0.0}"
time="2022-03-21T22:10:32+01:00" level=debug msg="Checkrunner sleeping for 120"
time="2022-03-21T22:10:58+01:00" level=debug msg="agent-hello {jX****************************cyixu 2.0.0}"
thanks I will do some testing without mesh. The agent should still check in without mesh so that is probably a bug
so from my initial testing with --nomesh
(been about 12 hours now on a few vms) I get that error in the logs about not finding the executable which obviously is expected but the agent continues to check in and doesn't freeze which also is expected so im still not sure why your agents are going offline. I have not tested on AWS or Azure though I will do that today
I have found some that were dying after installing mesh they stay online but some arent staying online long enough to install mesh, or get stuck on Getting Mesh node ID....., it doesnt seem to be just AWS, it seems to be random machines, across centos and ubuntu
A few moments ago I have removed the "mesh agent" executable from "/opt/tacticalmesh" and the issue occurred again. Would someone like to try my "installation" steps? I can share my builds and generated config to analyze them. What is worth noting, in my case "mesh agent" still works in the background, as it was installed separately.
Edit: I have tried to run RMM Agent on my NAS (Asustor). The same behavior. Without "mesh agent" status changed to offline; when executable placed in "/opt/tacticalmesh" everything works fine.
@ryszard-suchocki yes please share your installation steps
I am still unable to reproduce, I have been testing for a few days now, with mesh, without mesh. On azure, AWS, hetzner etc. Not able to reproduce at all
My env: Proxmox 6.X, agent build in Ubuntu 20.04 container (ubuntu-20.04-standard_20.04-1_amd64.tar.gz; Rel. 2021-04-05 13:09:49):
https://go.dev/dl/go1.17.8.linux-amd64.tar.gz && tar -C /usr/local/ -xzf go1.17.8.linux-amd64.tar.gz
wget https://github.com/amidaware/rmmagent/archive/refs/tags/v2.0.0.zip && apt install unzip
env CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -ldflags "-s -w"
rmmagent
executable to destination hostInstall:
-m install --api https://trmm.tld/ --client-id X --site-id X --agent-type server --auth a2c4e...XXXXXXXX
)./rmmagent -m install --api https://trmm.tld/ --client-id X --site-id X --agent-type server --auth a2c4e...XXXXXXXX **-nomesh**
@ryszard-suchocki please use the installation script that I linked to in a previous comment and see how that installs it and uses systemd to keep it running
also can you try send command and send df -h and see if it works?
Mine goes offline but can still send commands
ok all nevermind I found the bug, I forgot to spawn the function that attempts to sync the meshnodeid into it's own goroutine so it basically hangs forever when mesh is not installed LOL. will push a fix shortly
Fixed!
Hi, I'm testing the community beta Linux Agent for TRMM. I want to report that after a while Linux agent goes offline (status changed to offline), although the checks work fine. Also, it is possible to invoke remote commands, etc. so there is communication between agent and server. Could you verify on your side?
• Ubuntu 20.04 x86_64 5.4.0-104-generic • Agent v2.0.0
Temporary I'm running agent by invoking
./rmmagent -m svc
Best regards