Resiliency againt startup issues

NiceGuyIT commented 1 year ago

One server was offline and after researching the cause, I discovered there was an event log stating "A timeout was reached (30000 milliseconds) while waiting for the tacticalrmm service to connect.". It would be nice if the service (all OS's) was configured to stay running as best it can. For connectivity issues, retry logic is preferable over exiting after an initial failure to connect. If there's a domain configured, doing a fresh DNS lookup (can the agent clear the DNS cache?) and ping'ing the API until it's able to connect would be nice. If there's no domain configured, or if the agent configuration is corrupt, of course generate a friendly error message and exit.

Note: It's possible this could happen if the agent was restarted (computer rebooted) while the server was being updated and the API unavailable.

Log Name:      System
Source:        Service Control Manager
Date:          1/10/2023 5:22:50 PM
Event ID:      7009
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Win-Server04.****.****
Description:
A timeout was reached (30000 milliseconds) while waiting for the tacticalrmm service to connect.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Service Control Manager" Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}" EventSourceName="Service Control Manager" />
    <EventID Qualifiers="49152">7009</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8080000000000000</Keywords>
    <TimeCreated SystemTime="2023-01-11T01:22:50.657621300Z" />
    <EventRecordID>136367</EventRecordID>
    <Correlation />
    <Execution ProcessID="552" ThreadID="2404" />
    <Channel>System</Channel>
    <Computer>Win-Server04.****.****</Computer>
    <Security />
  </System>
  <EventData>
    <Data Name="param1">30000</Data>
    <Data Name="param2">tacticalrmm</Data>
    <Binary>74006100630074006900630061006C0072006D006D000000</Binary>
  </EventData>
</Event>

silversword411 commented 1 year ago

That looks like a Windows service error. Windows service said startup RMM...and when it checked to see if it was running, after 30000 milliseconds it wasn't. You sure AV didn't kill it?

Check agent.log to see if there's an error in there.

NiceGuyIT commented 1 year ago

I started TacticalAgent from Mesh and this server does not have Bitdefender. I did not find any other event logs relating to the service and there was nothing in the agent.log.

It seems this message is coming from Windows when trying to start the service. This kb article explains how the registry key ServicesPipeTimeout can be increased from the default 30 seconds to 60 seconds. I view that as a workaround, not solution.

The answers to this StackOverflow question have many other scenarios where this error may occur, including scenarios that do not relate to a timeout.

While some of this may be due to my environment, there are things the TacticalAgent can do. One is to add dependencies to the service so the service control manager starts it a little later in the boot process. I.e. after networking is available. Another option is to add a scheduled task to start the service if it's not running.

NiceGuyIT commented 1 year ago

Relevant Grafana issue 2060:

Similar to windows_exporter (And a bunch of other Go applications on windows) Grafana Agent can fail to start as a service on windows following a windows update or other high CPU Event during Service startup.

I wrote a fairly detailed analysis on the issue here that explains the cause but in an attempt to be brief this comes down to the way go initialises packages vs the time in which windows expects a response from an application starting as a service within 30s or 60s depending on your version of windows as per this diagram:

There's a PR linked in that issue that may be of use.

NiceGuyIT commented 1 year ago

Here's the relevant Go issue to fix the runtime: Windows service timeout during system startup.

silversword411 commented 1 year ago

...when I start on a computer, it's usually 1-2 seconds for the service to start and show as running by windows checks.

How long is it there, and how can we measure that "time to running" (I know there's a powershell measure command that might do it?)?

I'm thinking under normal conditions TRMM agent from start request to running is less than 5 seconds. Are you sure there's not other extenuating circumstances in your test there making it take longer than 30 seconds? Is it nebula network delays that might be causing TRMM to stutter?

NiceGuyIT commented 1 year ago

This happens only with high CPU usage. This thread specifically talks about this error happening only after rebooting to apply patches. This does not happen all the time. This is the first time I encountered this scenario while running TRMM for more than a year.

How long is it there, and how can we measure that "time to running" (I know there's a powershell measure command that might do it?)?

You can't measure this externally. The above thread mentioned they added an event log as the first action in main() but that was not getting called. This is because the timeout is in the actual startup of the program as explained in the Grafana issue.

'm thinking under normal conditions TRMM agent from start request to running is less than 5 seconds.

Under normal circumstances, that's true.

other extenuating circumstances in your test there making it take longer than 30 seconds?

Installing patches after a reboot could trigger this, but not all the time.

The link in the Grafana issue (here) explains how they are able to cause this to happen by limiting the CPU to 1/4 of a CPU in Hyper-V.

This issue is to address the Go runtime initialization slowness under high CPU load, as well as identify options that can be applied to alleviate the scenario.

amidaware / rmmagent

Resiliency againt startup issues #29