GoogleCloudPlatform / ops-agent

Apache License 2.0
142 stars 68 forks source link

Ops Agent Failed to restart - Intermittent #1598

Open abhigupta1207 opened 10 months ago

abhigupta1207 commented 10 months ago

NOTE: To get the best support experience for bug fixes, please go to https://cloud.google.com/support-hub and follow the instructions. In comparison, Bug reports filed in this repo only have best effort support, and do not have guaranteed response / resolution SLOs

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

  1. Create a custom image with Ops Agent (2.45.0@1) baked with 3rd party application
  2. Create instance template with startup scripts which copy custom config files for Ops Agent configuration and restart ops agent
  3. Create a MIG with autoscaling enabled, when new VMs are created due to autoscaling, ops agent in some VMs fails to restart

Expected behavior Ops Agent should restart because our autoscaling is based on custom prometheus metrics fetched by ops-agent

Environment (please complete the following information):

INFO 2024-01-16T14:04:35.692183900Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: Restart-Service : Service 'Google Cloud Ops Agent (google-cloud-ops-agent)' cannot be stopped due to the following

INFO 2024-01-16T14:04:35.727365100Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: error: Cannot stop google-cloud-ops-agent-opentelemetry-collector service on computer '.'.

INFO 2024-01-16T14:04:35.743086300Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: At C:\UNICOM\Scripts\Specialize-IntelligenceGoogleOpsAgent.ps1:54 char:5

INFO 2024-01-16T14:04:35.774301700Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: + Restart-Service google-cloud-ops-agent -Force

INFO 2024-01-16T14:04:35.806250700Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: + ~~~~~~~~~

INFO 2024-01-16T14:04:35.837901700Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: + CategoryInfo : CloseError: (System.ServiceProcess.ServiceController:ServiceController) [Restart-Service

INFO 2024-01-16T14:04:35.870150600Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: ], ServiceCommandException

INFO 2024-01-16T14:04:35.896522Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1: + FullyQualifiedErrorId : CouldNotStopService,Microsoft.PowerShell.Commands.RestartServiceCommand

INFO 2024-01-16T14:04:35.931813400Z [resource.labels.instanceId: 2851173124434244865] windows-startup-script-ps1:

Additional context Add any other context about the problem here. - Google Case 49102632

jefferbrecht commented 10 months ago

We would need to see your startup scripts in order to troubleshoot this effectively.

That being said, Restart-Service is notoriously fragile if it races with a service that's already in the process of starting up, which may be the case here. You can try inserting the following line before any Restart-Service to let it complete startup before attempting to restart:

(Get-Service google-cloud-ops-agent*).WaitForStatus('Running', '00:03:00')

The '00:03:00' is a timeout so that it doesn't block forever (e.g. in case there's a bad config or some other problem preventing normal startup); adjust this timeout as you need.

Since you're already creating a custom image anyway, another thing you can try instead is change the service startup type for all Ops Agent services to Manual. This should prevent the startup race and give your script a chance to copy in the configs before it starts up. Then you would call Start-Service instead of Restart-Service after copying the configs.

abhigupta1207 commented 10 months ago

@jefferbrecht Thanks for getting back, we will try the below solution and will update you.

Since you're already creating a custom image anyway, another thing you can try instead is change the service startup type for all Ops Agent services to Manual. This should prevent the startup race and give your script a chance to copy in the configs before it starts up. Then you would call Start-Service instead of Restart-Service after copying the configs.

github-actions[bot] commented 1 week ago

This issue was marked stale due to lack of activity. It will be closed in 14 days.