amidaware / rmmagent

Tactical RMM Agent
https://github.com/amidaware/tacticalrmm
Other
124 stars 71 forks source link

Linux Agent goes offline #1

Closed ryszard-suchocki closed 2 years ago

ryszard-suchocki commented 2 years ago

Hi, I'm testing the community beta Linux Agent for TRMM. I want to report that after a while Linux agent goes offline (status changed to offline), although the checks work fine. Also, it is possible to invoke remote commands, etc. so there is communication between agent and server. Could you verify on your side?

• Ubuntu 20.04 x86_64 5.4.0-104-generic • Agent v2.0.0

Temporary I'm running agent by invoking ./rmmagent -m svc

Best regards

dinger1986 commented 2 years ago

Where are your agents hosted?

ryszard-suchocki commented 2 years ago

Could you clarify in more simple words? The whole setup works in a simple environment, in LAN. Linux Agent works on a physical machine with "direct" access to TRMM. Other agents (Win) communicate fine (local and remote).

dinger1986 commented 2 years ago

ok, I am having issues with amazon agents but fine for all others

ryszard-suchocki commented 2 years ago

Could you elaborate on how you register agents? My approach was:

  1. Click Agents
  2. Install Agent -> Windows; I choose Client, Site
  3. Install Method -> Manual - copy the data required to register a new agent.
  4. On linux box -> ./rmmagent -m install --api https://trmm.tld --client-id X --site-id X --agent-type server --auth a2c4e...XXXXXXXX
  5. ./rmmagent -m svc
wh1te909 commented 2 years ago

https://github.com/amidaware/tacticalrmm/blob/develop/api/tacticalrmm/core/agent_linux.sh this should help

wh1te909 commented 2 years ago

you need to keep it running via systemd or something similar on your distro

georgebarnick commented 2 years ago

Installed using the above script with code-signed agents. Workig fine on a Ubuntu 20.04 test VM I made on my local VMware Workstation with no issues. Then deployed it on some AWS and Azure VMs I have (a mix of Ubuntu 20.04 and CentOS 7), and having the issue described in OP where they're going offline after a few minutes after running their first checks. The agents are running in systemd as suggested, and systemctl restart tacticalagent.service will bring them back to "online" status in the dashboard, but they slowly go back to offline again. Curious what to try next.

Edit: Further information about some examples of agents below

Agent that's working fine: Ubuntu 20.04 x86_64 5.4.0-105-generic • Agent v2.0.0 AWS Ubuntu agent that's going offline: Ubuntu 20.04 x86_64 5.13.0-1017-aws • Agent v2.0.0 Azure Ubuntu agent that's going offline: Ubuntu 20.04 x86_64 5.13.0-1017-azure • Agent v2.0.0 Azure CentOS agent that's going offline: Centos 7.9.2009 x86_64 3.10.0-1160.53.1.el7.x86_64 • Agent v2.0.0

Happy to provide any other troubleshooting information as-needed.

wh1te909 commented 2 years ago

@georgebarnick please enable debug logging so we can see where it's getting stuck modify /etc/systemd/system/tacticalagent.service and change

ExecStart=/usr/local/bin/tacticalagent -m svc

to

ExecStart=/usr/local/bin/tacticalagent -m svc -log debug

(add the -log debug) then systemctl daemon-reload && systemctl restart tacticalagent wait for agent to go offline then lets see what's in /var/log/tacticalagent.log

georgebarnick commented 2 years ago

@wh1te909 So far the only things in the log after the agent service restarts and goes through its checks and everything the first time is:

time="2022-03-21T20:02:23Z" level=debug msg="Checkrunner sleeping for 120"

every few minutes and

time="2022-03-21T20:02:24Z" level=debug msg="{Status:{Cmd:/opt/tacticalmesh/meshagent PID:0 Complete:false Exit:-1 Error:fork/exec /opt/tacticalmesh/meshagent: no such file or directory StartTs:1647892944163829150 StopTs:1647892944164151273 Runtime:0 Stdout:[] Stderr:[]} Stdout: Stderr:}\n"

every second.

I installed with the --nomesh flag on most if not all of these VMs that are going offline. Not sure if that's going to be related to the agent going offline or a separate issue, but maybe @ryszard-suchocki can chime in if he has the Mesh Agent with his affected install or not. The reason I did --nomesh was that the install seemed to get stuck on the "Getting mesh node id" step on one of them, so I just decided to omit it from all of them. I could try to reinstall with the mesh agent if you need and have an idea on why it might have gotten stuck there. I'm no expert with MeshCentral yet so haven't troubleshot that myself.

ryszard-suchocki commented 2 years ago

In my case, Mesh Agent has been installed before, separately to TRMM. I did not use -nomesh parameter when "installing" TRMM. So I decided to remove my agent and "install" it by passing -nomesh and -log debug parameters. Although -nomesh parameter the log file got filled by:

896886173760991 Runtime:0 Stdout:[] Stderr:[]} Stdout: Stderr:}\n"
time="2022-03-21T22:08:07+01:00" level=debug msg="{Status:{Cmd:/opt/tacticalmesh/meshagent PID:0 Complete:false Exit:-1 Error:fork/exec /opt/tacticalmesh/meshagent: no such file or directory StartTs:1647896887174316611 

so I decided to manually copy the meshagent executable to specified folder (which had not exist, need to be created manually). Now log look like below and agent status is correct, the last response time is updated correctly

time="2022-03-21T22:08:08+01:00" level=debug
time="2022-03-21T22:08:08+01:00" level=debug msg="{Status:{Cmd:/opt/tacticalmesh/meshagent PID:249577 Complete:true Exit:0 Error:<nil> StartTs:1647896888175850463 StopTs:1647896888267528965 Runtime:0.091678527 Stdout:[] Stderr:[]} Stdout:\n Stderr:}\n"
time="2022-03-21T22:08:10+01:00" level=debug msg="Checking for windows updates"
time="2022-03-21T22:08:32+01:00" level=debug msg="Checkrunner sleeping for 120"
time="2022-03-21T22:09:06+01:00" level=debug msg="agent-hello {jX****************************cyixu 2.0.0}"
time="2022-03-21T22:10:02+01:00" level=debug msg="agent-hello {jX****************************cyixu 2.0.0}"
time="2022-03-21T22:10:32+01:00" level=debug msg="Checkrunner sleeping for 120"
time="2022-03-21T22:10:58+01:00" level=debug msg="agent-hello {jX****************************cyixu 2.0.0}"
wh1te909 commented 2 years ago

thanks I will do some testing without mesh. The agent should still check in without mesh so that is probably a bug

wh1te909 commented 2 years ago

so from my initial testing with --nomesh (been about 12 hours now on a few vms) I get that error in the logs about not finding the executable which obviously is expected but the agent continues to check in and doesn't freeze which also is expected so im still not sure why your agents are going offline. I have not tested on AWS or Azure though I will do that today

dinger1986 commented 2 years ago

I have found some that were dying after installing mesh they stay online but some arent staying online long enough to install mesh, or get stuck on Getting Mesh node ID....., it doesnt seem to be just AWS, it seems to be random machines, across centos and ubuntu

ryszard-suchocki commented 2 years ago

A few moments ago I have removed the "mesh agent" executable from "/opt/tacticalmesh" and the issue occurred again. Would someone like to try my "installation" steps? I can share my builds and generated config to analyze them. What is worth noting, in my case "mesh agent" still works in the background, as it was installed separately.

Edit: I have tried to run RMM Agent on my NAS (Asustor). The same behavior. Without "mesh agent" status changed to offline; when executable placed in "/opt/tacticalmesh" everything works fine.

wh1te909 commented 2 years ago

@ryszard-suchocki yes please share your installation steps

I am still unable to reproduce, I have been testing for a few days now, with mesh, without mesh. On azure, AWS, hetzner etc. Not able to reproduce at all

ryszard-suchocki commented 2 years ago

My env: Proxmox 6.X, agent build in Ubuntu 20.04 container (ubuntu-20.04-standard_20.04-1_amd64.tar.gz; Rel. 2021-04-05 13:09:49):

  1. Deploy container
  2. apt update && apt upgrade
  3. wget https://go.dev/dl/go1.17.8.linux-amd64.tar.gz && tar -C /usr/local/ -xzf go1.17.8.linux-amd64.tar.gz
  4. nano /etc/environment && add /usr/local/go/bin to PATH
  5. wget https://github.com/amidaware/rmmagent/archive/refs/tags/v2.0.0.zip && apt install unzip
  6. cd rmmagent2.0
  7. env CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -ldflags "-s -w"
  8. scp rmmagent executable to destination host

Install:

  1. TRMM UI -> click Agents
  2. Install Agent -> Windows; I choose Client, Site
  3. Install Method -> Manual - copy the data required to register a new agent (-m install --api https://trmm.tld/ --client-id X --site-id X --agent-type server --auth a2c4e...XXXXXXXX)
  4. On Linux box -> ./rmmagent -m install --api https://trmm.tld/ --client-id X --site-id X --agent-type server --auth a2c4e...XXXXXXXX **-nomesh**
  5. ./rmmagent -m svc -l debug
wh1te909 commented 2 years ago

@ryszard-suchocki please use the installation script that I linked to in a previous comment and see how that installs it and uses systemd to keep it running

dinger1986 commented 2 years ago

also can you try send command and send df -h and see if it works?

Mine goes offline but can still send commands

wh1te909 commented 2 years ago

ok all nevermind I found the bug, I forgot to spawn the function that attempts to sync the meshnodeid into it's own goroutine so it basically hangs forever when mesh is not installed LOL. will push a fix shortly

ryszard-suchocki commented 2 years ago

Fixed!