Investigate why Tentacle fails to start after SCCM updates/restart combo

vanessalove commented 8 years ago

I have two cases where Tentacle fails to start after SCCM is used to run windows updates and restart the server. The Tentacle service doesn't start and the logs are clear.

Concerned we will have more reports of this over time.

Current workaround to add a dependency.

Source: https://secure.helpscout.net/conversation/166828848/4665/?folderId=557083 https://secure.helpscout.net/conversation/170025070/4900/?folderId=557082

Update We have been discussing this within the team and have come up with a possible solution outlined below:

We'd most likely want to do this as a scheduled task as making it a .NET windows service would most likely experience the same issues as the Tentacle service currently does
An intermediary option could be to update the recovery options to retry if it fails to start (we don't currently do this automatically, perhaps we should)
We should add more diagnostics logging to the Tentacle to try and find the root cause of the issue with the Tentacle failing to start
- If we could determine why/what stopped the service
- Maybe auto capture a crash dump of the Tentacle service if it failed to start
We'll need to consider in this is how to handle the situation where the service is deliberately stopped.
- For example, the DSC allows you to set a state of Stopped, which will ensure the service is stopped. We don't want the watchdog service to come along and start it and then the DSC stops it and each of them tripping over each other...

Dalmirog-zz commented 8 years ago

I had a report of this exact issue about 2-3 months ago. SCCM reboot after updates, Tentacles were not coming back.

Cant find the ticket in tender, but customer was supposed to reply with some extra information.

Edit: Found it: http://help.octopusdeploy.com/discussions/problems/42446

On this ticket there are 2 users reporting this. One of them initiated the reboots from SCOM, and the other one had their machines automatically rebooted after updates/hardware failure. None of them rebooted the machines manually.

gurufordy commented 8 years ago

Just to help out as I've got the same issue on lots of servers when they reboot after an SCCM update.

Event Viewer: 1) A timeout was reached (30000 milliseconds) while waiting for the OctopusDeploy Tentacle service to connect. 2) The OctopusDeploy Tentacle service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

Attached is the server log to help :)

OctopusTentacle.0.txt

Starting the service manually worked the next day.

Server Version = 3.2.16

Dalmirog-zz commented 8 years ago

Another report: https://secure.helpscout.net/conversation/182593110/5708

eugeneromero commented 8 years ago

+1 for this. We have an environment with almost 200 servers. Tentacles fail to come up after SCCM reboots. If started manually they work fine. Our logs basically look the same as the others already posted. Octopus and Tentacle version is 3.3.1. Let me know if there is any other info I can provide.

jeffpardy commented 8 years ago

Current workaround to add a dependency.

@vanessalove What is the dependent service for the workaround?

eugeneromero commented 8 years ago

Has there been any progress made on this issue?

vanessalove commented 8 years ago

@jlbrown the dependency I've been told a few customers have used successfully is the W3SCV service. ie: sc config "OctopusDeploy Tentacle" depend= Workstation/TCPIP/W3SVC

bobmein commented 8 years ago

@vanessalove Is there any more troubleshooting you need from us to identify the bug and work toward resolving it?

richjrader commented 8 years ago

+1 having agents down on multiple servers randomly

matsomahoney commented 8 years ago

+1 having agents go down and not come up after regular windows updates or reboots

matt-richardson commented 8 years ago

I'm just starting to dig into this issue and it looks a doozy (especially as we haven't been able to reproduce as yet).

@papa-gene, @bobmein & anyone else - if you could enable trace level logging on the tentacle and the server, and send me both logs to support@octopus.com - that should give me a bit more information to help diagnose.

There is a chance that it has been fixed in 3.3.21 (specifically by this fix), so if anyone is able to upgrade and see if the issue is resolved, that would be very helpful.

I will continue to dig into the issue in the meantime.

matt-richardson commented 8 years ago

A few questions that should help narrow it down:

Does this happen on polling tentacles? Or listening tentacles? Or both?
Does the service actually show as running and not respond, or is it stopped?
Has anyone faced this issue after a normal reboot? Or is it only after a windows update / sccm reboot?

Thanks in advance!

richjrader commented 8 years ago

For my organization:

We only use listening tentacles but we are experiencing it with them.
The service on the machine shows as enabled/automatic start up but just appears as though it never started up
I have only ever seen this after a restart due to windows update. We restart our sandbox pretty often and it doesn't run into this issue.

eugeneromero commented 8 years ago

In our case:

Our infrastructure is made up of listening tentacles.
The service shows as being stopped, and it must be manually started.
Have not seen this after normal reboots, only Windows Update reboots.

Thank you for looking into it :)

matt-richardson commented 8 years ago

Thanks - that helps narrow down the investigation.

I've never used SCCM before, but I've up a test lab to debug this issue. I've set up a package that has 'reboot after run' set, and so far it seems to be starting as expected - I'll keep trying though.

Is it only happening when windows updates are pushed by SCCM? Or is it when packages are pushed as well?

Also, if anyone has any more logs to share, especially at trace level, starting from the time from machine boots, that would be very useful.

thanks!

stajs commented 8 years ago

I'm wondering if SCCM/SCOM is incidental; it sounds like they are just pushing down a Windows Update, and it's Windows Update forcing a reboot...?

The only comment that I can see that doesn't mention Windows Update is this one from @papa-gene, but then he clarifies that it is Windows Update reboots.

richjrader commented 8 years ago

It is very likely in my opinion that windows update itself is forcing the reboot that is causing the issues.My company wrote some other software to manage windows updates but it was my understanding they orcestrated SCCM under the hood.

WillParry commented 8 years ago

We have listening tentacles on Windows IIS Servers running in Azure. We're having this issue on random servers on an almost daily basis. Our servers a powered down every week day evening and powered on each week day morning by an Azure scheduled PowerShell script.

The Tentacle service seems to be the root of the problem when it occurs, as it fails to start in less than 30,000 ms. One set of servers doesn't exhibit the problem and on these servers the Tentacle service generally starts within 10 - 20 seconds of the event logging service (event id 6009). In one environment we've overcome the problem by setting the tentacle service to automatic delayed start and the startup time of 2 - 3 minutes is less of an issue than failure to start up at all.

I think we're running version 3.2.2. The OS is Windows Server 2012 R2.

matt-richardson commented 8 years ago

Unfortunately, we are still having trouble reproducing this - I've had a machine in azure rebooting on a loop, and the Tentacle has come up consistently (it rebooted successfully 111 times in a row).

I've added a bunch of extra logging into the Tentacle startup process that is logged at Trace level, and this has gone out as 3.3.25. If someone is able to do an upgrade we can investigate further.

I know that doing an upgrade to try and investigate this is less than ideal, but at this point we are struggling to determine what is going on.

Thanks in advance.

eugeneromero commented 8 years ago

Thanks again for looking into this Matt. Again, one of the main problems with this issue is that it cannot be easily reproduced. As I said at the beginning of the thread, our environment consists of ~200 tentacles. We see this issue on different instances, but never the same ones. When updates are pushed to the servers, some of the tentacles fail, but it might only be 2-8 out of the 200. However, it does seem like the issue only comes up when servers have rebooted from a Windows update. If you have a WM just rebooting, I don't think you'll see this crop up. Good to hear there is an update, I hope some useful info can come from it.

joewaid commented 8 years ago

This issue happens fairly often for me as well. I see it most often with rebooting after Windows Updates. I run a custom powershell script across 30 - 40 VMs running on a Hyper-V failover cluster and as they complete a forced reboot(Restart-Computer -Force).

I also see this happen after shutdowns and subsequent powerups initiated through the Failover cluster manager. Though this isn't consistent unfortunately.

I've switched the OctopusDeploy service to delayed start in many cases and that hasn't seemed to have made a difference.

EDIT: I'm running 3.4.0 tentacles.

richjrader commented 8 years ago

Happened again over the weekend - two servers restarted with identical hardware. One runs windows server 08 and has a lot of applications. One runs windows server 12 and has less applications. The 08 server started up the tentacle fine. The 12 server experienced:

A timeout was reached (30000 milliseconds) while waiting for the OctopusDeploy Tentacle service to connect.

The OctopusDeploy Tentacle service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

--EDIT-- Also the tentacle was updated to 3.3.26 prior to today in order to try and find the issue.

matt-richardson commented 8 years ago

Hi @joewaid, @hobodoompants - are you able to share the logs from the Tentacles that failed? (Either here, or to support@octopus.com.)

@joewaid if you've got a reasonably consistent repro, are you able to do some further investigation for me? If you could change the logging level on a tentacle to Trace (see instructions) and see if you can get the issue to happen again?

Thanks in advance!

richjrader commented 8 years ago

I pulled up the logs - but the tentacle never logged anything whatsoever during the timeframe it was down. I noticed it had not started two hours (11am) after the server had been started (9am) after windows updates over the weekend. The previous log entry was for the 8th and then today the first line in the log file is:

2016-10-10 11:00:50.0137 7 INFO Agent will trust Octopus servers with the thumbprint:

joewaid commented 8 years ago

@matt-richardson I have to do Windows Updates this week, I'll turn up the trace level and see what I get. I had some updates on the Hypervisor over the weekend that showed the problem, but I don't see anything in the Octopus.Log for the time frame it failed to start. I just see the following in my windows Event Log:

A timeout was reached (30000 milliseconds) while waiting for the OctopusDeploy Tentacle service to connect.

The OctopusDeploy Tentacle service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

matt-richardson commented 8 years ago

Thanks. Let me know how it goes.

The really strange thing is that one of the first things we do is ask for a 2 minute timeout on service start - it appears that somehow its not even getting to that point in the code.

joewaid commented 8 years ago

I emailed you the logs that I captured, but unfortunately it doesn't seem to show anything happening. I also included the windows event logs before and after the errors showing that Octopus tentacle failed to start. I hope that helps.

I'm wondering if this issue is somehow limited to certain Hyper-V environments like this post http://stackoverflow.com/questions/1986292/a-timeout-was-reached-while-waiting-for-the-service-to-connect-error-after-reb

I'm tempted to try changing the servicepipetimeout on my VMs to see if that helps.

matt-richardson commented 8 years ago

Edit: Logs are at https://secure.helpscout.net/conversation/263227944/10409

Looking through the event logs you provided, @joewaid, I noticed that the Microsoft .NET Framework NGEN v4.0.30319_X64 service was during the time the tentacle attempted to start. Can you (or anyone else) take a look at some of the other servers that restarted okay, and see if that service was running on those as well? If it was not running, then we might have a suspect.

I've got a sneaking suspicion that the NGEN service is kicking in and trying to create native versions of the dll's, and preventing the service from actually starting up. (I've never come across this before, but there's a first time for everything.)

joewaid commented 8 years ago

Interestingly I looked at one of the servers where Octopus started and before the SCM told Octopus to start there are entries indicating that the NGEN entered a running state then the very next entry it entered the stopped state.

In the event logs I sent you for the machine where Octopus failed to start, the NGEN service entered the running state, a couple other services were started, Octopus failed to start, then the NGEN service entered a stopped state.

I'll check another couple machines to see if that pattern holds.

joewaid commented 8 years ago

I checked the ngen_service.log file and I am seeing some entries indicating that it is actively doing something during the time frame where the tentacle fails to start.

ngen_service.log > 10/12/2016 19:50:16.613 [496]: Service state is now SERVICE_START_PENDING 10/12/2016 19:50:16.628 [496]: Service state is now SERVICE_RUNNING 10/12/2016 19:50:16.628 [496]: SetServiceStartType(): Setting start type as 2 10/12/2016 19:56:17.972 [496]: Processing work for priority 3 10/12/2016 19:56:17.972 [496]: Assemblies to compile: 546

Octopus client had its failure to start at 7:50:49 PM .

matt-richardson commented 8 years ago

That's now looking like it may be the cause of the issue. Is anyone else on this thread able to cross-reference the service failure with the ngen_service.log file?

I will setup a test environment next week to see if I can reproduce.

matt-richardson commented 8 years ago

Unfortunately, ngen doesn't appear to be the culprit after all.

It appears that in certain circumstances where the load on the system is high at boot time, .net services can fail to start (StackOverflow has several mentions of this).

We have a few ideas on how we can handle it better, but for now the best bet is still to add a dependency to the service as mentioned above.

michaelnoonan commented 8 years ago

A customer commented on their approach: http://help.octopusdeploy.com/discussions/questions/9681

I like that the Tentacle install results in what I would consider to be a "default Windows Service", and I like that people can opt-in for an alternate configuration.

I would like to see if we can make an easier way for people to "enable" extra "robustness" features, like a default set of failure actions, or even a scheduled task that acts as a watchdog, ensuring the Tentacle Service is up and running ALL THE TIME! woof

gurufordy commented 8 years ago

Thanks will run this script against all our servers and see if it helps.

For some reason the dependency that Vanessa referenced doesn't work properly and so we had to take it off. Not sure of the reasons as we just removed it without doing any diagnosis. Will post back if we see what the problem was.

michaelnoonan commented 7 years ago

We've been putting some more thought into this based on this thread, and other customer reports in differing scenarios. It appears we won't be able to fix the root cause of the problem since the Tentacle startup is already highly optimized. And we cannot do anything about how the Windows Service Manager treats a service that fails to start in the first 30s.

If we can't fix the root cause of the problem, perhaps the best option we have is to offer a robust workaround.

For example, we could add a Scheduled Task which runs every 1 minute (configurable) where the logic in the task says:

Find all Windows Services that look like Octopus Tentacles
If a Tentacle is Running, perfecto!
If a Tentacle is Stopped, and it is set to Automatic Start or Automatic Start - Delayed, try to Start it
If a Tentacle is Stopped, and it is set to Disabled or Manual, leave it Stopped

In effect this would be like saying "I always want a Tentacle running on this machine: make it so." If Tentacle fails to start when Windows boots, the next run of the scheduled task would try to start it.

Some other complexities to consider are:

We need to handle automated Tentacle upgrades, so the Scheduled Task understands what is happening - a lock file would probably suffice here
We should make it easy to add/remove the Scheduled Task: Tentacle.exe add-watchdog --Interval 00:02:00
We should write watchdog logs to a sensible folder

I'd be interested in anyone's feedback to this approach. :)

stajs commented 7 years ago

FWIW, we ended up taking a similar approach for other service dependencies (MSMQ, SQL Server) that would occasionally not recover from a Windows Update: a Scheduled Task watchdog to check their status and start them if needed on a regular basis.

I think it's the best we can do... 😒

davidwallis commented 7 years ago

I'd rather you wrote a watchdog service rather than a crappy solution of a scheduled task.. What about setting it to delayed start?

PS this is randomly effecting us too.

davidwallis commented 7 years ago

Also the dependency option of w3svc is ok if you are deploying to web servers.. what about SQL..

Need to pick a different service that is common across all.. but off the top of my head I'm struggling to think of a common late starting service...

Having a google suggests that delayed start is 2mins after the last service to start, so might work.. Delayed Start

stajs commented 7 years ago

If your watchdog is a Windows Service, it can be affected by the same problem that causes this. A Scheduled Task has been effective in our testing.

BTW, the delayed start duration is configurable via the registry.

matt-richardson commented 7 years ago

One thing we'll need to consider in this is how to handle the situation where the service is deliberately stopped.

For example, the DSC allows you to set a state of Stopped, which will ensure the service is stopped. We don't want the watchdog service to come along and start it and then the DSC stops it and each of them tripping over each other...

hnrkndrssn commented 7 years ago

A few other notes from some discussions within the team:

We'd most likely want to do this as a scheduled task as making it a .NET windows service would most likely experience the same issues as the Tentacle service currently does
An intermediary option could be to update the recovery options to retry if it fails to start (we don't currently do this automatically, perhaps we should)
We should add more diagnostics logging to the Tentacle to try and find the root cause of the issue with the Tentacle failing to start
- If we could determine why/what stopped the service
- Maybe auto capture a crash dump of the Tentacle service if it failed to start

davidwallis commented 7 years ago

We'd most likely want to do this as a scheduled task as making it a .NET windows service would most likely experience the same issues as the Tentacle service currently does

I cant say I like the idea of a scheduled task, it's a bodge, simple as.... Ok lets call it a temporary workaround.. but probably one our security guys will frown opon. Also I've written a good few services and never had this issue. I would suggest getting the service to start quickly (IE less than 30 secs) and 'logically speaking - not technically!' then something like spawn a worker thread that carries on with the rest of the tasks.

An intermediary option could be to update the recovery options to retry if it fails to start (we don't currently do this automatically, perhaps we should)

Definitely a good first attempt, but I would also add try delayed start too, as these are built in and available options that require no dev work.

We should add more diagnostics logging to the Tentacle to try and find the root cause of the issue with the Tentacle failing to start

Definitely as other services are working fine on the boxes I see the problem with, leading me to believe/guess the octopus tentacle is waiting on other dependencies or doing something that takes longer than 30's -regardless of if the service is calling .RequestAdditionalTime()

David

stajs commented 7 years ago

I'm curious, why would a scheduled task make your security guys frown?

matt-richardson commented 7 years ago

Thanks for chiming in @davidwallis3101. We have done extensive analysis, and the delay is in the .net initialisation. Our code doesn't even get a chance to execute.

As it is, our code sets a service timeout of 2 minutes straight away, and then spawns a worker thread. The very first line of code is a Log.Write("Starting...") and that never executes.

We've got > 100k Tentacle installations and we are only seeing this in a very limited number of installations.

Its also worth pointing out that this does affect other services as well, as @stajs mentioned.

So, while we'd much prefer to have a nicer solution to this, we think the scheduled task solution will be the most effective.

That said, I'd like to hear why you think this might be a security concern?

slewis74 commented 7 years ago

Watchdog

fenneh commented 7 years ago

I've seen this crop up multiple times in the past and have now decided to go down the route of a scheduled task being deployed to servers as part of the tentacle deployment (via Octopus) which we do.

For anyone else looking to do this, I've done the following one liner.

schtasks.exe /create /tn "Octopus Tentacle Reviver" /sc minute /mo 30 /tr "NET START 'OctopusDeploy Tentacle'" /RU system /F

It'll effectively try to start the tentacle every 30 minutes, may not be suitable for some but definitely is working for us and getting us through patching windows.

Edit; worth noting that this may interfere with tentacle upgrades.

slewis74 commented 7 years ago

We've just released an update that includes a new command line option to set up a scheduled task, which is conceptually very similar to the one described above (physical mechanics are a bit different). Main difference to above is the command supports identifying instances (you can * or select specific instances if you want to). Our documentation has more detail.

gurufordy commented 7 years ago

Watchdog? Surely you could have come up with a better aquatic name instead!

Great job anyway guys as this has been on my todo list for a while and now can just do a simple upgrade to get it.

Thanks!

StephenShamakian commented 7 years ago

FWIW: Someone posted this (http://help.octopusdeploy.com/discussions/questions/9681) a few months ago and I commented as we were having the issues also! We have used the settings in the first post to help alleviate the problem. But just to be clear setting the on failure actions and/or the delayed auto-start does not fix the issue. @vanessalove suggested we turn on trace logging. We may still do that. But we are interested in trying this watchdog solution also!

MJRichardson commented 7 years ago

The recommended solution for anyone encountering this issue is to use: https://github.com/OctopusDeploy/Issues/issues/3067

OctopusDeploy / Issues

Investigate why Tentacle fails to start after SCCM updates/restart combo #2375