Open NiceGuyIT opened 1 year ago
That looks like a Windows service error. Windows service said startup RMM...and when it checked to see if it was running, after 30000 milliseconds it wasn't. You sure AV didn't kill it?
Check agent.log to see if there's an error in there.
I started TacticalAgent from Mesh and this server does not have Bitdefender. I did not find any other event logs relating to the service and there was nothing in the agent.log
.
It seems this message is coming from Windows when trying to start the service. This kb article explains how the registry key ServicesPipeTimeout
can be increased from the default 30 seconds to 60 seconds. I view that as a workaround, not solution.
The answers to this StackOverflow question have many other scenarios where this error may occur, including scenarios that do not relate to a timeout.
While some of this may be due to my environment, there are things the TacticalAgent can do. One is to add dependencies to the service so the service control manager starts it a little later in the boot process. I.e. after networking is available. Another option is to add a scheduled task to start the service if it's not running.
Relevant Grafana issue 2060:
Similar to windows_exporter (And a bunch of other Go applications on windows) Grafana Agent can fail to start as a service on windows following a windows update or other high CPU Event during Service startup.
I wrote a fairly detailed analysis on the issue here that explains the cause but in an attempt to be brief this comes down to the way go initialises packages vs the time in which windows expects a response from an application starting as a service within 30s or 60s depending on your version of windows as per this diagram:
There's a PR linked in that issue that may be of use.
Here's the relevant Go issue to fix the runtime: Windows service timeout during system startup.
...when I start on a computer, it's usually 1-2 seconds for the service to start and show as running by windows checks.
How long is it there, and how can we measure that "time to running" (I know there's a powershell measure command that might do it?)?
I'm thinking under normal conditions TRMM agent from start request to running is less than 5 seconds. Are you sure there's not other extenuating circumstances in your test there making it take longer than 30 seconds? Is it nebula network delays that might be causing TRMM to stutter?
This happens only with high CPU usage. This thread specifically talks about this error happening only after rebooting to apply patches. This does not happen all the time. This is the first time I encountered this scenario while running TRMM for more than a year.
How long is it there, and how can we measure that "time to running" (I know there's a powershell measure command that might do it?)?
You can't measure this externally. The above thread mentioned they added an event log as the first action in main()
but that was not getting called. This is because the timeout is in the actual startup of the program as explained in the Grafana issue.
'm thinking under normal conditions TRMM agent from start request to running is less than 5 seconds.
Under normal circumstances, that's true.
other extenuating circumstances in your test there making it take longer than 30 seconds?
Installing patches after a reboot could trigger this, but not all the time.
The link in the Grafana issue (here) explains how they are able to cause this to happen by limiting the CPU to 1/4 of a CPU in Hyper-V.
This issue is to address the Go runtime initialization slowness under high CPU load, as well as identify options that can be applied to alleviate the scenario.
One server was offline and after researching the cause, I discovered there was an event log stating "A timeout was reached (30000 milliseconds) while waiting for the tacticalrmm service to connect.". It would be nice if the service (all OS's) was configured to stay running as best it can. For connectivity issues, retry logic is preferable over exiting after an initial failure to connect. If there's a domain configured, doing a fresh DNS lookup (can the agent clear the DNS cache?) and ping'ing the API until it's able to connect would be nice. If there's no domain configured, or if the agent configuration is corrupt, of course generate a friendly error message and exit.
Note: It's possible this could happen if the agent was restarted (computer rebooted) while the server was being updated and the API unavailable.