Graylog2 / collector-sidecar

Manage log collectors through Graylog
https://www.graylog.org/
Other
268 stars 57 forks source link

Problem with Windows Service and Reboot #142

Closed planki closed 7 years ago

planki commented 7 years ago

Problem description

If you do a reboot from a Windows Server, the collector will not shudown cleanly and doesn't connect to to Graylog after the reboot. We can reproduce this on all our Windows Servers 2012 R2

Logfile:

time="2017-02-13T03:11:23+01:00" level=info msg="Stopping signal distributor" 
time="2017-02-13T03:11:23+01:00" level=info msg="[nxlog] Stopping" 
time="2017-02-13T03:11:23+01:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress." 
time="2017-02-13T03:11:23+01:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress." 
time="2017-02-13T03:11:48+01:00" level=info msg="Starting signal distributor" 
time="2017-02-13T03:11:48+01:00" level=info msg="[nxlog] Stopping" 
time="2017-02-13T03:11:49+01:00" level=error msg="[nxlog] Could not send stop control: The service has not been started." 
time="2017-02-13T03:12:38+01:00" level=info msg="[nxlog] Configuration change detected, rewriting configuration file." 
time="2017-02-13T03:12:39+01:00" level=info msg="[nxlog] Stopping" 
time="2017-02-13T03:12:39+01:00" level=error msg="[nxlog] Could not send stop control: The service has not been started." 
time="2017-02-13T03:13:52+01:00" level=info msg="[nxlog] Starting (svc driver)" 
time="2017-02-13T03:13:52+01:00" level=info msg="[nxlog] Starting (svc driver)" 
time="2017-02-13T03:13:52+01:00" level=error msg="[nxlog] Could not start service: An instance of the service is already running." 

Steps to reproduce the problem

  1. Reboot the Windows server.

Environment

mariussturm commented 7 years ago

Thanks for the feedback! During installation, did you run the NXLog uninstall command to prevent a standalone service of NXLog? See: http://docs.graylog.org/en/2.2/pages/collector_sidecar.html#id4

C:\Program Files (x86)\nxlog\nxlog -u

Otherwise two instances of NXLog are running which doesn't work.

planki commented 7 years ago

Hi,

Yes, I have uninstalled the nxlog service. I just have done now a more detailed test with one of our testservers.

I have installed sidecar 0.1.0-rc1 above an existing installation. Both service (sidecar and nxlog backend) services were stopped and now process were running. This are my results:

`Fresh install:

time="2017-02-13T15:37:06+01:00" level=info msg="Starting signal distributor" time="2017-02-13T15:37:06+01:00" level=info msg="[nxlog] Stopping" time="2017-02-13T15:37:06+01:00" level=error msg="[nxlog] Could not send stop control: The service has not been started." time="2017-02-13T15:37:07+01:00" level=info msg="[nxlog] Starting (svc driver)" time="2017-02-13T15:37:15+01:00" level=info msg="[nxlog] Configuration change detected, rewriting configuration file." time="2017-02-13T15:37:15+01:00" level=info msg="[nxlog] Stopping" time="2017-02-13T15:37:18+01:00" level=info msg="[nxlog] Starting (svc driver)"

Collector will registered at Graylog

Inialize Reboot:

time="2017-02-13T15:38:36+01:00" level=info msg="Stopping signal distributor" time="2017-02-13T15:38:36+01:00" level=info msg="[nxlog] Stopping" time="2017-02-13T15:38:36+01:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress." time="2017-02-13T15:38:36+01:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress."

After Reboot:

time="2017-02-13T15:38:55+01:00" level=info msg="Starting signal distributor" time="2017-02-13T15:38:55+01:00" level=info msg="[nxlog] Stopping" time="2017-02-13T15:38:55+01:00" level=error msg="[nxlog] Could not send stop control: The service has not been started." time="2017-02-13T15:39:20+01:00" level=info msg="[nxlog] Configuration change detected, rewriting configuration file." time="2017-02-13T15:39:20+01:00" level=info msg="[nxlog] Stopping" time="2017-02-13T15:39:20+01:00" level=error msg="[nxlog] Could not send stop control: The service has not been started." time="2017-02-13T15:40:42+01:00" level=info msg="[nxlog] Starting (svc driver)" time="2017-02-13T15:40:42+01:00" level=info msg="[nxlog] Starting (svc driver)" time="2017-02-13T15:40:42+01:00" level=error msg="[nxlog] Could not start service: An instance of the service is already running."

Collector will have status failed at Graylog. Although both services (sidecar and nxlog backend) are running?`

Maybe it's only a timing problem or you have to check, if the service is already running/stopped before you try to start/stop the service? Although I don't understand, why the collector will have status failed at Graylog, when both services are running.

Thanks for your support

mariussturm commented 7 years ago

Odd, it clearly says there is a second instance running. Somebody has to start it. Just to double check, could you share the output of the powershell command: Get-Service | Sort-Object status,displayname

planki commented 7 years ago

Please see attched file - thanks

services.txt

mariussturm commented 7 years ago

Ok thanks, I can reproduce this now. Working on a fix.

planki commented 7 years ago

Thanks, that sounds good :+1:

mariussturm commented 7 years ago

@planki could you please test this with 0.1.0-rc.2 again? https://github.com/Graylog2/collector-sidecar/releases/tag/0.1.0-rc.2 If all is fine thats the final 0.1.0 then.

planki commented 7 years ago

Hi,

I just checked it, and now I will get an error during the update, if the service is running - you have to manualy stop the service before?

Please see attached screenshot.

After the reboot from the server I will get unfortunately the same errors and the collector will be down in the Graylog page ...

time="2017-03-13T08:32:03+01:00" level=info msg="Starting signal distributor" time="2017-03-13T08:32:03+01:00" level=info msg="[nxlog] Stopping" time="2017-03-13T08:32:03+01:00" level=error msg="[nxlog] Could not send stop control: The service has not been started." time="2017-03-13T08:32:13+01:00" level=info msg="[nxlog] Configuration change detected, rewriting configuration file." time="2017-03-13T08:32:35+01:00" level=info msg="[nxlog] Starting (svc driver)" time="2017-03-13T08:32:36+01:00" level=info msg="[nxlog] Stopping" time="2017-03-13T08:32:36+01:00" level=error msg="[nxlog] Could not send stop control: The requested control is not valid for this service." time="2017-03-13T08:32:36+01:00" level=info msg="[nxlog] Starting (svc driver)" time="2017-03-13T08:32:37+01:00" level=error msg="[nxlog] Could not start service: An instance of the service is already running."

Thanks and regards,

Martin

2017-03-10 11:00 GMT+01:00 Marius Sturm notifications@github.com:

@planki https://github.com/planki could you please test this with 0.1.0-rc.2 again? https://github.com/Graylog2/collector-sidecar/releases/ tag/0.1.0-rc.2 If all is fine thats the final 0.1.0 then.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Graylog2/collector-sidecar/issues/142#issuecomment-285627531, or mute the thread https://github.com/notifications/unsubscribe-auth/ARzHRg_rVSlEvnpQENAsINtcgsLNnxGXks5rkR8rgaJpZM4L-1V_ .

mariussturm commented 7 years ago

Ok, I will check the update prodecure. Could you please stop and deregister the service before doing the update, like:

Could be that you still run the old version.

planki commented 7 years ago

Hi,

The Graylog-collector-sidecar.exe was replaced and date modified with 10.03.2017 10:52 .. so this should be the RC2 version. I am pretty sure, that with service stop and uninstall the update will work.

I can do further tests today afternoon.

Regards,

Martin

2017-03-13 9:19 GMT+01:00 Marius Sturm notifications@github.com:

Ok, I will check the update prodecure. Could you please stop and deregister the service before doing the update, like:

  • graylog-collector-sidecar -service stop
  • graylog-collector-sidecar -service uninstall
  • do the update

Could be that you still run the old version.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Graylog2/collector-sidecar/issues/142#issuecomment-286041184, or mute the thread https://github.com/notifications/unsubscribe-auth/ARzHRuecrjtWKMf4Pc0W0l4wVC0ZA-kFks5rlPv4gaJpZM4L-1V_ .

jlad00 commented 7 years ago

Hi,

I'm seeing this same issue in my environment, but on version 0.1.3 of the sidecar collector. Was this resolved at 0.1.0?

joschi commented 7 years ago

@jlad00 Please open a new bug report if you're seeing similar or identical errors with Graylog Collector Sidecar 0.1.3.

eyece commented 6 years ago

I'm encountering this exact problem with Sidecar 0.1.5 and NXLog-CE 2.9.1716. After a server is rebooted, its collector status changes to "Failing," but logs continue to be shipped to Graylog 2.4.3. I've confirmed that the default NXLog service is not running on any servers (it is disabled). I can reproduce the issue on Windows Server 2008 R2 simply by rebooting it.

time="2018-03-18T05:04:44-07:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress." time="2018-03-18T05:04:45-07:00" level=error msg="[nxlog] Backend finished unexpectedly, sending restart signal" time="2018-03-18T05:04:45-07:00" level=info msg="[nxlog] Stopping" time="2018-03-18T05:04:45-07:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress." time="2018-03-18T05:04:45-07:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress." time="2018-03-18T05:04:45-07:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress." time="2018-03-18T05:04:45-07:00" level=error msg="[nxlog] Failed to connect to service manager: [A system shutdown is in progress.]" time="2018-03-18T05:04:45-07:00" level=info msg="Stopping signal distributor" time="2018-03-18T05:04:45-07:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress." time="2018-03-18T05:04:45-07:00" level=info msg="[nxlog] Stopping" time="2018-03-18T05:04:45-07:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress." time="2018-03-18T05:05:16-07:00" level=info msg="Starting signal distributor" time="2018-03-18T05:05:16-07:00" level=info msg="[nxlog] Stopping" time="2018-03-18T05:05:16-07:00" level=error msg="[nxlog] Could not send stop control: The service has not been started." time="2018-03-18T05:05:27-07:00" level=info msg="[nxlog] Configuration change detected, rewriting configuration file." time="2018-03-18T05:05:32-07:00" level=info msg="[nxlog] Starting (svc driver)" time="2018-03-18T05:05:32-07:00" level=info msg="[nxlog] Stopping" time="2018-03-18T05:05:32-07:00" level=error msg="[nxlog] Could not send stop control: The requested control is not valid for this service." time="2018-03-18T05:05:32-07:00" level=info msg="[nxlog] Starting (svc driver)" time="2018-03-18T05:05:32-07:00" level=error msg="[nxlog] Could not start service: An instance of the service is already running." time="2018-03-18T05:47:51-07:00" level=info msg="Stopping signal distributor" time="2018-03-18T05:47:51-07:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress." time="2018-03-18T05:47:51-07:00" level=info msg="[nxlog] Stopping" time="2018-03-18T05:47:51-07:00" level=error msg="[nxlog] Failed to connect to service manager: A system shutdown is in progress." time="2018-03-18T05:48:36-07:00" level=info msg="Starting signal distributor" time="2018-03-18T05:48:36-07:00" level=info msg="[nxlog] Stopping" time="2018-03-18T05:48:36-07:00" level=error msg="[nxlog] Could not send stop control: The service has not been started." time="2018-03-18T05:48:48-07:00" level=info msg="[nxlog] Configuration change detected, rewriting configuration file." time="2018-03-18T05:48:55-07:00" level=info msg="[nxlog] Starting (svc driver)" time="2018-03-18T05:48:55-07:00" level=info msg="[nxlog] Stopping" time="2018-03-18T05:48:55-07:00" level=error msg="[nxlog] Could not send stop control: The requested control is not valid for this service." time="2018-03-18T05:48:55-07:00" level=info msg="[nxlog] Starting (svc driver)" time="2018-03-18T05:48:55-07:00" level=error msg="[nxlog] Could not start service: An instance of the service is already running."