amidaware / tacticalrmm

A remote monitoring & management tool, built with Django, Vue and Go.
https://docs.tacticalrmm.com
Other
3.2k stars 440 forks source link

services crash when installed along-side other software #262

Closed bbrendon closed 3 years ago

bbrendon commented 3 years ago

We've been having a problem for the last few weeks where services start crashing randomly on servers. The reason this got so much attention internally is because dhcpserver kept crashing. Some of the services that crash are: nxlog, dhcpserver, Wecsvc, Schedule, lmhosts, eventlog

The most popular service to crash is nxlog

EDIT: I should add that by "crashing", I mean it didn't recover. It seems that after digging further many services crash but only a few don't recover.

This has been seen on OS: 2008r2, 2016std, win10pro, sbs2011

All of these servers also have another program that collects event logs which doesn't crash.

Timeline from today's crashes 1/27/2021 - 09:30 - updated Tactical from 0.4.0 to 0.4.1 (agent 1.3 to 1.4) 1/27/2021 - 09:47 - Monitoring triggered 4x servers, all different customers, all at about exactly the same time.

Since this seems related to nxlog + Tactical , my next step is going to see which version of nxlog these machines have and see if I can predict it in the future based on nxlog being installed and versions.

bbrendon commented 3 years ago

After looking through the System event logs on h-dc2 a bunch of services are stopped immediately after the tactical agent stops which I'm presuming is because of the 1.3.0 to 1.4.0 agent update. Many are "stopped" and many are terminated unexpectedly. By "many", I mean about 20+ services. Somehow the server didn't totally melt-down.

After all the stops (error and informational events), tactical starts and other services begin starting again. But it seems not all services recover fully.

bbrendon commented 3 years ago

I inspected the System Event log on h-dc1 (does not have nxlog). It looks much better. no errors. There are a few services that start and stop around the time of the tactical update but they seem be doing it by design. Services restarted : windows module installer service, appxsvc, diagnostic host service

No critical services were restarted.

rtwright68 commented 3 years ago

Funny, we have been seeing the same exact thing. The same services you mentioned above are what we are observing. I haven't updated yet to 0.41 since I have some agents that haven't updated from 1.12.

bbrendon commented 3 years ago

@rtwright68 Are you running nxlog as well? This has been going on for a few weeks. Its not specific to agent 1.4.0

bbrendon commented 3 years ago

I just saw someone post this in the chat. This looks exactly like it! Well, not exactly, but crazy similar.

image

tektrak commented 3 years ago

I've seen the same thing. Didn't really notice the details until last night when updating Tactical RMM Agents since the updates run in the background. However, the log file reveals this issue. One server, a Windows Remote Desktop Host, showed some files in use by a number of services. The updater tried to kill the service processes, but had problems, aborted the install, and rolled back. However, a number of services did suffer in the process. Attached is the contents of the tacticalrmm.txt file from C:\Windows\temp\tacticalrmmxxxxxxxxx (although the line endings of this file may have changed in the transfers between systems). tacticalrmm.txt

It seems like it would be preferable to delay completion of the upgrade until after a reboot rather than killing off things that might be using the files.

By the way, we are not running nxlog.

wh1te909 commented 3 years ago

ive changed the inno setup executable in agent 1.4.1 to restart applications that were closed during an update as a test can you guys please try the following:

un-check auto agent update in Settings > Global Settings update rmm to 0.4.2 if not done so already via update.sh script download agent 1.4.1 from https://github.com/wh1te909/rmmagent/releases/download/v1.4.1/winagent-v1.4.1.exe put it somewhere on your agents filesystem then open cmd as admin, cd to the directory of the exe and call it like this please winagent-v1.4.1.exe /VERYSILENT /SUPPRESSMSGBOXES /LOG=test123.txt make note of the services that were being stopped but not restarting and see if now they restart and also paste the test123.txt here for me to see. thx.

tektrak commented 3 years ago

On the client server running Windows Server 2012 R2 Standard that had problems last time, I ran the attempted upgrade from 1.4.0 to 1.4.1 manually as described above. The ugprade failed and afterwards some services were not restarted. Here is the status of the services after this attempt for the services indicated in the log file attached:

Status   Name               DisplayName
------   ----               -----------
Stopped  Audiosrv           Windows Audio
Running  Dhcp               DHCP Client
Running  EventLog           Windows Event Log
Running  lmhosts            TCP/IP NetBIOS Helper
Stopped  Wcmsvc             Windows Connection Manager
Stopped  iphlpsvc           IP Helper
Running  WinHttpAutoProx... WinHTTP Web Proxy Auto-Discovery Se...
Running  netprofm           Network List Service
Running  NlaSvc             Network Location Awareness

After noting the status, I manually restarted the stopped services without issue. test123.txt

wh1te909 commented 3 years ago

@tektrak thanks can u try now same exe call it like this plz winagent-v1.4.1.exe /VERYSILENT /FORCECLOSEAPPLICATIONS /LOG=testforceclose.txt

tektrak commented 3 years ago

The agent seems to have updated now. Attached is the log file. testforceclose.txt

wh1te909 commented 3 years ago

thanks. seems this time there were no applications in use so can you keep trying until u can get the log to show RestartManager found an application using one of our files ... and then i want to see if it will restart them. the original log file says it was aborted, which is the default when using the /SUPPRESSMSGBOXES flag so was hoping that removing that flag would make it retry

Although i still dont understand why its saying it found applications in use. the inno setup exe stops and kill all tacticalrmm.exe processes before it starts to do the update so no sure whats going on. tactical does not interact with any of those services especially nxlog, never even heard of that. and am not able to reproduce on any of my agents

tektrak commented 3 years ago

I'll work through other updates and note any issues.

I just tried a Windows 10 Pro workstation that seemed to be stuck on v1.1.11. So I tried running

winagent-v1.1.2.exe /VERYSILENT /SUPPRESSMSGBOXES /LOG=log-v1.1.2.txt

It failed. Attached is the log. log-v1.1.2.txt

wh1te909 commented 3 years ago

u need to use 1.1.12, not 1.1.2

tektrak commented 3 years ago

Oops. I'll retry that one.

But here is the output from a Windows Server 2016 Standard that is a Windows Remote Desktop Host being upgraded from agent v1.4.0 to v1.4.1 .\winagent-v1.4.1.exe /VERYSILENT /SUPPRESSMSGBOXES /LOG=log-v1.4.1.txt Here are the services that got stopped:

diff *before2* *after2*
10c10
< Running  Appinfo            Application Information               
---
> Stopped  Appinfo            Application Information               
14c14
< Running  AppXSvc            AppX Deployment Service (AppXSVC)     
---
> Stopped  AppXSvc            AppX Deployment Service (AppXSVC)     
30c30
< Running  CertPropSvc        Certificate Propagation               
---
> Stopped  CertPropSvc        Certificate Propagation               
39c39
< Stopped  DeviceAssociati... Device Association Service            
---
> Running  DeviceAssociati... Device Association Service            
61c61
< Stopped  fdPHost            Function Discovery Provider Host      
---
> Running  fdPHost            Function Discovery Provider Host      
74c74
< Running  iphlpsvc           IP Helper                             
---
> Stopped  iphlpsvc           IP Helper                             
92c92
< Running  MSMQ               Message Queuing                       
---
> Stopped  MSMQ               Message Queuing                       
98c98
< Running  NetMsmqActivator   Net.Msmq Listener Adapter             
---
> Stopped  NetMsmqActivator   Net.Msmq Listener Adapter             
149c149
< Running  SENS               System Event Notification Service     
---
> Stopped  SENS               System Event Notification Service     
239c239
< Running  WinRM              Windows Remote Management (WS-Manag...
---
> Stopped  WinRM              Windows Remote Management (WS-Manag...
241c241
< Running  wlidsvc            Microsoft Account Sign-in Assistant   
---
> Stopped  wlidsvc            Microsoft Account Sign-in Assistant   
244c244
< Running  WpnService         Windows Push Notifications System S...
---
> Stopped  WpnService         Windows Push Notifications System S...

log-v1.4.1.txt

wh1te909 commented 3 years ago

replace /SUPPRESSMSGBOXES with /FORCECLOSEAPPLICATIONS when u call the exe suppress will by default abort. i want to see if without that flag if will restart them

bbrendon commented 3 years ago

Ran.... winagent-v1.4.1.exe /VERYSILENT /FORCECLOSEAPPLICATIONS /LOG=log-v1.4.1.txt

After a some seconds, this dialogue box appeared.

image

...I tried selecting "try again", but that didn't do anything so I selected ignore and continue.

After all was said and done, the list of automatic services that were running before and after the agent upgrade did not change. The list is below. It appears though that some of these should be running and were not. I think this was because the server was never rebooted after the last agent upgrade debacle.

clr_optimization_v4.0.30319_32
clr_optimization_v4.0.30319_64
lmhosts
Schedule
SolutionreachService
sppsvc
Wecsvc

Log below.

2021-01-29 21:30:30.897   Log opened. (Time zone: UTC-08:00)
2021-01-29 21:30:30.897   Setup version: Inno Setup version 6.1.2
2021-01-29 21:30:30.897   Original Setup EXE: C:\Users\user\Downloads\winagent-v1.4.1.exe
2021-01-29 21:30:30.897   Setup command line: /SL5="$4038A,3456399,824832,C:\Users\user\Downloads\winagent-v1.4.1.exe" /VERYSILENT /FORCECLOSEAPPLICATIONS /LOG=log-v1.4.1.txt
2021-01-29 21:30:30.897   Windows version: 6.1.7601 SP1  (NT platform: Yes)
2021-01-29 21:30:30.897   64-bit Windows: Yes
2021-01-29 21:30:30.897   Processor architecture: x64
2021-01-29 21:30:30.897   User privileges: Administrative
2021-01-29 21:30:30.900   Administrative install mode: Yes
2021-01-29 21:30:30.900   Install mode root key: HKEY_LOCAL_MACHINE
2021-01-29 21:30:30.900   64-bit install mode: No
2021-01-29 21:30:30.905   Created temporary directory: C:\Users\user\AppData\Local\Temp\is-FH4M2.tmp
2021-01-29 21:30:33.586   Stop tacticalagent: 0
2021-01-29 21:30:36.448   Stop tacticalrpc: 0
2021-01-29 21:30:36.656   taskkill: 128
2021-01-29 21:30:36.757   Found 2 files to register with RestartManager.
2021-01-29 21:30:36.757   Calling RestartManager's RmGetList.
2021-01-29 21:30:36.803   RmGetList finished successfully.
2021-01-29 21:30:36.803   RestartManager found an application using one of our files: DHCP Client
2021-01-29 21:30:36.803   RestartManager found an application using one of our files: Windows Event Log
2021-01-29 21:30:36.803   RestartManager found an application using one of our files: WinHTTP Web Proxy Auto-Discovery Service
2021-01-29 21:30:36.803   RestartManager found an application using one of our files: nxlog
2021-01-29 21:30:36.803   RestartManager found an application using one of our files: DHCP Server
2021-01-29 21:30:36.803   Can use RestartManager to avoid reboot? Yes (0)
2021-01-29 21:30:36.814   Starting the installation process.
2021-01-29 21:30:36.816   Shutting down applications using our files. (forced)
2021-01-29 21:30:49.358   Some applications could not be shut down.
2021-01-29 21:30:49.358   Message box (Abort/Retry/Ignore):
                          Setup was unable to automatically close all applications. It is recommended that you close all applications using files that need to be updated by Setup before continuing.
2021-01-29 21:31:11.630   User chose Retry.
2021-01-29 21:31:11.630   Retrying to shut down applications using our files. (forced)
2021-01-29 21:31:11.647   Some applications could not be shut down.
2021-01-29 21:31:11.647   Message box (Abort/Retry/Ignore):
                          Setup was unable to automatically close all applications. It is recommended that you close all applications using files that need to be updated by Setup before continuing.
2021-01-29 21:31:13.687   User chose Ignore.
2021-01-29 21:31:13.688   Directory for uninstall files: C:\Program Files\TacticalAgent
2021-01-29 21:31:13.688   Will append to existing uninstall log: C:\Program Files\TacticalAgent\unins000.dat
2021-01-29 21:31:13.688   -- File entry --
2021-01-29 21:31:13.688   Dest filename: C:\Program Files\TacticalAgent\unins000.exe
2021-01-29 21:31:13.695   Time stamp of our file: 2021-01-29 21:30:30.551
2021-01-29 21:31:13.695   Dest file exists.
2021-01-29 21:31:13.695   Time stamp of existing file: 2021-01-27 09:35:20.972
2021-01-29 21:31:13.695   Version of our file: 51.1052.0.0
2021-01-29 21:31:13.719   Version of existing file: 51.1052.0.0
2021-01-29 21:31:13.719   Installing the file.
2021-01-29 21:31:13.727   Leaving temporary file in place for now.
2021-01-29 21:31:13.728   -- File entry --
2021-01-29 21:31:13.728   Dest filename: C:\Program Files\TacticalAgent\tacticalrmm.exe
2021-01-29 21:31:13.728   Time stamp of our file: 2021-01-29 00:05:16.000
2021-01-29 21:31:13.728   Dest file exists.
2021-01-29 21:31:13.728   Time stamp of existing file: 2021-01-26 23:27:08.000
2021-01-29 21:31:13.728   Installing the file.
2021-01-29 21:31:14.492   Successfully installed the file.
2021-01-29 21:31:14.492   -- File entry --
2021-01-29 21:31:14.492   Dest filename: C:\Program Files\TacticalAgent\nssm.exe
2021-01-29 21:31:14.493   Time stamp of our file: 2020-12-15 01:27:00.000
2021-01-29 21:31:14.493   Dest file exists.
2021-01-29 21:31:14.493   Time stamp of existing file: 2020-12-15 01:27:00.000
2021-01-29 21:31:14.493   Installing the file.
2021-01-29 21:31:14.528   Successfully installed the file.
2021-01-29 21:31:14.528   -- Icon entry --
2021-01-29 21:31:14.528   Dest filename: C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Tactical RMM Agent.lnk
2021-01-29 21:31:14.529   Creating the icon.
2021-01-29 21:31:14.638   Successfully created the icon.
2021-01-29 21:31:14.641   Saving uninstall information.
2021-01-29 21:31:14.641   Renaming uninstaller.
2021-01-29 21:31:14.643   Deleting uninstall key left over from previous administrative 32-bit install.
2021-01-29 21:31:14.645   Creating new uninstall key: HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Uninstall\{0D34D278-5FAF-4159-A4A0-4E2D2C08139D}_is1
2021-01-29 21:31:14.645   Writing uninstall key values.
2021-01-29 21:31:14.652   Detected previous non administrative install? No
2021-01-29 21:31:14.652   Detected previous administrative 64-bit install? No
2021-01-29 21:31:14.779   Installation process succeeded.
2021-01-29 21:31:14.779   Attempting to restart applications.
2021-01-29 21:31:27.464   Need to restart Windows? No
2021-01-29 21:31:27.468   Deinitializing Setup.
2021-01-29 21:31:31.747   Start tacticalagent: 0
2021-01-29 21:31:33.943   Start tacticalrpc: 0
2021-01-29 21:31:33.949   Log closed.
tektrak commented 3 years ago

Here is a Windows Server 2012 R2 Standard that was upgraded successfully from v1.4.0 to v1.4.1.

.\winagent-v1.4.1.exe /VERYSILENT /FORCECLOSEAPPLICATIONS /LOG=log-v1.4.1.txt

All of the services mentioned:

2021-01-30 00:15:43.749   RestartManager found an application using one of our files: DHCP Client
2021-01-30 00:15:43.749   RestartManager found an application using one of our files: Windows Event Log
2021-01-30 00:15:43.749   RestartManager found an application using one of our files: TCP/IP NetBIOS Helper
2021-01-30 00:15:43.749   RestartManager found an application using one of our files: Windows Connection Manager
2021-01-30 00:15:43.749   RestartManager found an application using one of our files: IP Helper
2021-01-30 00:15:43.749   RestartManager found an application using one of our files: WinHTTP Web Proxy Auto-Discovery Service
2021-01-30 00:15:43.749   RestartManager found an application using one of our files: Network List Service
2021-01-30 00:15:43.749   RestartManager found an application using one of our files: Network Location Awareness
2021-01-30 00:15:43.749   RestartManager found an application using one of our files: Sage Service Host (v20.3)

were running after the upgrade. log-v1.4.1.txt

rtwright68 commented 3 years ago

log-v141-1.txt v141-2.txt

Ran on a couple Windows 2019 VMs that were stuck on 1.1.12. Still showing the old version number in the agent dashboard.

rtwright68 commented 3 years ago

One other piece of info. Both of the agents I attempted the 1.4.1 update on are currently yellow. All services are up and running at this point, attempted a reboot on one of the agents.

tektrak commented 3 years ago

@rtwright68 I had some older agents on v1.1.11 and v1.1.12. The v1.1.11 agent I first updated to v1.1.12. Then updated the v1.1.12 agents to v1.2.0, then v1.3.0, then v1.4.1. I may not have needed to do all these intermediate steps, but I believe at least that you shouldn't skip v1.3.0 before going to v1.4.0 or v1.4.1.

wh1te909 commented 3 years ago

@tektrak yes that's correct, always need to update incremental based on the minor version number so you did good

@rtwright68 you need to uninstall those agents they are broken, straight upgrade from 1.1.12 to 1.4.1 will break the agents

wh1te909 commented 3 years ago

are you guys able to this try with this exe please?

winagent-v1.3.555.zip

please run it like this and then upload the txt file

winagent-v1.3.555.exe /VERYSILENT /LOG=13555.txt

then wait like 15 seconds and check if the exe was replaced by running

"C:\Program Files\TacticalAgent\tacticalrmm.exe" -version

and see if it shows version 1.3.555 ive changed this exe to not close and not restart any services, since my theory is that those services are not in use anyway by tactical so no point in trying to close them

tektrak commented 3 years ago

Here's the first upgrade I tried on the first Windows Server 2012 R2 Standard server I mentioned above. It upgraded fine to the new version and connected to the TRMM server. The log is attached.

I'll also try it on a few more previously troublesome systems. 13555.txt

tektrak commented 3 years ago

And here's another run on the Windows Server 2016 Standard server mentioned above. It also upgraded fine to the new version and connected to the TRMM server. The log is attached. 13555.txt

No services were harmed in the making of these upgrades. Thanks!

tektrak commented 3 years ago

Just ran the v1.3.555 agent upgrade test on 3 more servers including a Microsoft Windows Server Core 2016 without incident.

wh1te909 commented 3 years ago

@tektrak that's amazing wow! really hope this is the fix lol @bbrendon can u try plz? if it works for you i'll get this released asap

bbrendon commented 3 years ago

Seems like it worked. Log https://pastebin.com/QpNs3m34 It just had an issue with nssm.exe

No service issues that I could see.

wh1te909 commented 3 years ago

@bbrendon thanks. I'll be getting rid of nssm eventually since it's no longer actively maintained, for now ive just changed the updater to not attempt to replace that file.

I'll be releasing an update to rmm shortly and with it agent v1.4.2 Since I changed the function inside the agent that handles agent update, when your agents update to 1.4.2 it will probably still attempt to force close those services since it will still be using the code from 1.4.1, so it won't be until the next agent after 1.4.2 until the issue is fully resolved so you might still need to manually update agents until they are all on 1.4.2

tektrak commented 3 years ago

I have installed the rmm server update and more than half the agents (the half that are online now, including the servers) are now on the new v1.4.2. They seem to have updated without incident. I updated two servers via command line so that I could watch the process in more detail. The rest were manually updated via the web interface, as I currently have agent auto update disabled. Thanks for your efforts improving the update process!

bbrendon commented 3 years ago

Same here. No sirens went off. Looking good.