Closed akmi closed 2 years ago
2019-07-25T04:18:42.820Z DEBUG [service] service/service.go:55 Received sighup, stopping
Looks like something is sending SIGPHUP signal to the metricbeat process.
I'm seeing the same behavior on CentOS 6. This is happening in our automated testing environment, within Docker. metricbeat will start successfully, connect to kafka for output, but get SIGHUP'ed by the kernel after just a few seconds. If I manually start the service after this happens, it runs just fine.
I strace'd the process to see the source of the SIGHUP, and it's coming from the kernel:
--- SIGHUP {si_signo=SIGHUP, si_code=SI_KERNEL} ---
rt_sigprocmask(SIG_UNBLOCK, [HUP], NULL, 8) = 0
gettid() = 2332
tkill(2332, SIGHUP) = 0
--- SIGHUP {si_signo=SIGHUP, si_code=SI_TKILL, si_pid=2332, si_uid=0} ---
rt_sigaction(SIGHUP, {SIG_DFL, ~[], SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x7f3567f817e0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [HUP], NULL, 8) = 0
gettid() = 2332
tkill(2332, SIGHUP) = 0
--- SIGHUP {si_signo=SIGHUP, si_code=SI_TKILL, si_pid=2332, si_uid=0} ---
+++ killed by SIGHUP +++
However, nothing ran the initscript to stop metricbeat, which I was able to ascertain by logging into the Docker instance and adding the following line between when metricbeat was installed and when it was started:
echo "ran initscript with opts: $@" >>/tmp/mbinit.log
The result of this was the following:
ran initscript with opts: status
ran initscript with opts: start
ran initscript with opts: status
ran initscript with opts: status
ran initscript with opts: status
So for some reason the kernel is SIGHUP'ing metricbeat when it starts the first time, but not any subsequent time, and I have no idea why. I tried several 6.x and 7.x releases of metricbeat and found the same behavior.
I just switched from Docker to Vagrant and I get the same behavior on CentOS 6.10 in both. If anything, I would expect a SIGHUP to be sent to metricbeat-god
instead of metricbeat
.
I've ruled out selinux as it is disabled in our test environment for both Docker and Vagrant.
I should also note that I don't see this behavior at all in CentOS 7.
Some further info: I'm starting metricbeat via test-kitchen. It's making an SSH connection to run its commands, and when it runs the init script, it's starting as a foreground process (note that the below is captured in the several-second interval when the process is still running, before test-kitchen disconnects from SSH):
[kitchen@localhost ~]$ ps axc -o pid,ppid,tty,stat,time,command | grep metricbeat | grep -v grep
7731 1 pts/1 S+ 00:00:00 metricbeat-god
7732 7731 pts/1 Sl+ 00:00:00 metricbeat
Note the +
, identifying it as a foreground process. Once test-kitchen disconnects from SSH, the TTY is closed and the process is killed. Note that when I SSH into the container and run the initscript, the process doesn't have the "+" next to it:
[kitchen@localhost ~]$ sudo /sbin/service metricbeat start
Starting metricbeat: [ OK ]
[kitchen@localhost ~]$ ps axc -o pid,ppid,tty,stat,time,command | grep metricbeat | grep -v grep
9719 1 pts/0 S 00:00:00 metricbeat-god
9720 9719 pts/0 Sl 00:00:00 metricbeat
Looks like this might just be an upstream bug, at least in my case.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This is still very much a problem.
Pinging @elastic/integrations-services (Team:Services)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Still a problem.
I have the same issue in k8s (as prometheus remote write).
k8s container status:
terminated
Reason: Completed - exit code: 0
logs:
021-08-10T13:58:55.726Z INFO [monitoring] log/log.go:144 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cgroup":{"cpu":{"stats":{"periods":301,"throttled":{"ns":31997241754,"periods":261}}},"cpuacct":{"total":{"ns":2949814152}},"memory":{"mem":{"usage":{"bytes":49082368}}}},"cpu":{"system":{"ticks":520,"time":{"ms":134}},"total":{"ticks":6460,"time":{"ms":2407},"value":6460},"user":{"ticks":5940,"time":{"ms":2273}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":211},"info":{"ephemeral_id":"fa8851e1-75dc-43c5-9cb5-ca1203e92c50","uptime":{"ms":91799}},"memstats":{"gc_next":167169808,"memory_alloc":135552736,"memory_sys":851968,"memory_total":845988144,"rss":479932416},"runtime":{"goroutines":309}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":350,"active":50,"batches":7,"total":350},"read":{"bytes":6758},"write":{"bytes":399078}},"pipeline":{"clients":1,"events":{"active":4117,"published":350,"total":350},"queue":{"acked":350}}},"metricbeat":{"prometheus":{"remote_write":{"events":350,"success":350}}},"system":{"load":{"1":2.78,"15":3.09,"5":2.54,"norm":{"1":1.39,"15":1.545,"5":1.27}}}}}}
2021-08-10T13:59:32.220Z INFO [monitoring] log/log.go:144 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cgroup":{"cpu":{"stats":{"periods":300,"throttled":{"ns":30203836939,"periods":254}}},"cpuacct":{"total":{"ns":2992549419}},"memory":{"mem":{"usage":{"bytes":39141376}}}},"cpu":{"system":{"ticks":600,"time":{"ms":166}},"total":{"ticks":8910,"time":{"ms":2956},"value":8910},"user":{"ticks":8310,"time":{"ms":2790}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":200},"info":{"ephemeral_id":"fa8851e1-75dc-43c5-9cb5-ca1203e92c50","uptime":{"ms":121798}},"memstats":{"gc_next":387447184,"memory_alloc":351301280,"memory_sys":69536776,"memory_total":1179790592,"rss":518823936},"runtime":{"goroutines":175}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":150,"active":50,"batches":3,"total":150},"read":{"bytes":2891},"write":{"bytes":154000}},"pipeline":{"clients":1,"events":{"active":4117,"published":150,"total":150},"queue":{"acked":150}}},"metricbeat":{"prometheus":{"remote_write":{"events":150,"success":150}}},"system":{"load":{"1":2.46,"15":3.04,"5":2.47,"norm":{"1":1.23,"15":1.52,"5":1.235}}}}}}
2021-08-10T13:59:34.023Z INFO [monitoring] log/log.go:152 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cgroup":{"cpu":{"cfs":{"period":{"us":100000},"quota":{"us":10000}},"id":"d33fb58de20a7da059e418dd4eba741dedf3d3e8f9445eb0de8ab6c5943b9a9b","stats":{"periods":1208,"throttled":{"ns":119198093406,"periods":1012}}},"cpuacct":{"id":"d33fb58de20a7da059e418dd4eba741dedf3d3e8f9445eb0de8ab6c5943b9a9b","total":{"ns":11980896782}},"memory":{"id":"d33fb58de20a7da059e418dd4eba741dedf3d3e8f9445eb0de8ab6c5943b9a9b","mem":{"limit":{"bytes":536870912},"usage":{"bytes":489992192}}}},"cpu":{"system":{"ticks":690,"time":{"ms":694}},"total":{"ticks":9600,"time":{"ms":9609},"value":9600},"user":{"ticks":8910,"time":{"ms":8915}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":84},"info":{"ephemeral_id":"fa8851e1-75dc-43c5-9cb5-ca1203e92c50","uptime":{"ms":130299}},"memstats":{"gc_next":487526256,"memory_alloc":369537568,"memory_sys":430970920,"memory_total":1251564552,"rss":562049024},"runtime":{"goroutines":116}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":1100,"active":50,"batches":23,"total":1150},"read":{"bytes":28325},"type":"elasticsearch","write":{"bytes":1149018}},"pipeline":{"clients":0,"events":{"active":4116,"failed":4,"published":5216,"retry":50,"total":5220},"queue":{"acked":1100,"max_events":4096}}},"system":{"cpu":{"cores":2},"load":{"1":2.26,"15":3.03,"5":2.43,"norm":{"1":1.13,"15":1.515,"5":1.215}}}}}}
2021-08-10T13:59:34.023Z INFO [monitoring] log/log.go:153 Uptime: 2m10.304298888s
2021-08-10T13:59:34.023Z INFO [monitoring] log/log.go:130 Stopping metrics logging.
2021-08-10T13:59:34.023Z INFO [api] api/server.go:66 Stats endpoint (127.0.0.1:5066) finished: accept tcp 127.0.0.1:5066: use of closed network connection
2021-08-10T13:59:34.023Z INFO instance/beat.go:479 metricbeat stopped.
@Sebor do you still face this issue with the latest metricbeat binary?
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
@jlind23 I can't answer the question because I gave up metricbeat :)
Backlog grooming: Closing for now until further update.
haha, whatever
@terminalmage reopen it as it seems that you are still facing it.
There are a few different reports in this issue, I'm not sure if they all have the same root cause.
Metricbeat (and I think all the beats) are explicitly handling SIGHUP and using it as a signal to gracefully exit. See https://github.com/elastic/beats/commit/e9537aa779f1988a1f33ef6028a908615eb19e34
Metricbeat terminating when it receives SIGHUP because the terminal it was run as a foreground process from is closed is the behaviour you would expect from this.
Metricbeat itself isn't in control of when it receives SIGHUP, so I'm not sure what fix we could provide here besides changing the signal handling behaviour.
It's being run by sysvinit, not directly from the TTY that SSH'ed in. And I don't see this behavior with any other daemons in our test infra, so this isn't "expected behavior" IMO.
But to be honest, I worked around this in our test suite (quite literally) years ago. The problem doesn't appear under systemd, and CentOS 6 has reached EOL since I first reported the behavior. At this point, I probably care as much about this being fixed as you do. :man_shrugging:
Thanks for the follow up. We'll assume this was some quirk of sysvinit on CentOS 6 unless other reports of this on more recent versions show up.
Same issue in Ubuntu, Metricbeat stops automatically,
Metric beat stops after some time in all my instances with the same stack trace as shown below:
**2019-07-25T04:18:40.303Z DEBUG [elasticsearch] elasticsearch/client.go:339 PublishEvents: 20 events have been published to elasticsearch in 25.030741ms. 2019-07-25T04:18:40.304Z DEBUG [publisher] memqueue/ackloop.go:160 ackloop: receive ack [792: 0, 20] 2019-07-25T04:18:40.304Z DEBUG [publisher] memqueue/eventloop.go:535 broker ACK events: count=13, start-seq=10287, end-seq=10299
2019-07-25T04:18:40.304Z DEBUG [publisher] memqueue/eventloop.go:535 broker ACK events: count=7, start-seq=925, end-seq=931
2019-07-25T04:18:40.304Z DEBUG [publisher] memqueue/ackloop.go:128 ackloop: return ack to broker loop:20 2019-07-25T04:18:40.304Z DEBUG [publisher] memqueue/ackloop.go:131 ackloop: done send ack 2019-07-25T04:18:42.820Z DEBUG [service] service/service.go:55 Received sighup, stopping 2019-07-25T04:18:42.820Z INFO cfgfile/reload.go:230 Dynamic config reloader stopped 2019-07-25T04:18:42.820Z INFO [reload] cfgfile/list.go:118 Stopping 3 runners ... 2019-07-25T04:18:42.820Z DEBUG [reload] cfgfile/list.go:129 Stopping runner: system [metricsets=2] 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:149 client: closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:151 client: done closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:155 client: cancelled 0 events 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=fsstat, host=] 2019-07-25T04:18:42.820Z DEBUG [reload] cfgfile/list.go:129 Stopping runner: system [metricsets=1] 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:149 client: closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:151 client: done closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:155 client: cancelled 0 events 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=uptime, host=] 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:145 Stopped Wrapper[name=system, len(metricSetWrappers)=1] 2019-07-25T04:18:42.820Z DEBUG [reload] cfgfile/list.go:131 Stopped runner: system [metricsets=1] 2019-07-25T04:18:42.820Z DEBUG [reload] cfgfile/list.go:129 Stopping runner: system [metricsets=7] 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:149 client: closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:151 client: done closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:155 client: cancelled 0 events 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=cpu, host=] 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=filesystem, host=] 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:145 Stopped Wrapper[name=system, len(metricSetWrappers)=2] 2019-07-25T04:18:42.820Z DEBUG [reload] cfgfile/list.go:131 Stopped runner: system [metricsets=2] 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=process, host=] 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=memory, host=] 2019-07-25T04:18:42.821Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=network, host=] 2019-07-25T04:18:42.821Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=process_summary, host=] 2019-07-25T04:18:42.821Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=socket_summary, host=] 2019-07-25T04:18:42.821Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=load, host=] 2019-07-25T04:18:42.821Z DEBUG [module] module/wrapper.go:145 Stopped Wrapper[name=system, len(metricSetWrappers)=7] 2019-07-25T04:18:42.821Z DEBUG [reload] cfgfile/list.go:131 Stopped runner: system [metricsets=7] 2019-07-25T04:18:42.822Z INFO [monitoring] log/log.go:153 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":17480,"time":{"ms":17486}},"total":{"ticks":45780,"time":{"ms":45795},"value":45780},"user":{"ticks":28300,"time":{"ms":28309}}},"handles":{"limit":{"hard":1048576,"soft":1024},"open":6},"info":{"ephemeral_id":"8f1d9c9b-509d-4878-8ba1-a818dc12fa34","uptime":{"ms":7923590}},"memstats":{"gc_next":12860064,"memory_alloc":6875352,"memory_total":7567558816,"rss":58175488},"runtime":{"goroutines":14}},"libbeat":{"config":{"module":{"running":0},"reloads":1},"output":{"events":{"acked":11239,"batches":793,"total":11239},"read":{"bytes":350161},"type":"elasticsearch","write":{"bytes":27570717}},"pipeline":{"clients":0,"events":{"active":0,"published":11239,"retry":25,"total":11239},"queue":{"acked":11239}}},"metricbeat":{"system":{"cpu":{"events":793,"success":793},"filesystem":{"events":798,"success":798},"fsstat":{"events":133,"success":133},"load":{"events":793,"success":793},"memory":{"events":793,"success":793},"network":{"events":1586,"success":1586},"process":{"events":4748,"success":4748},"process_summary":{"events":793,"success":793},"socket_summary":{"events":793,"success":793},"uptime":{"events":9,"success":9}}},"system":{"cpu":{"cores":2},"load":{"1":0,"15":0,"5":0.02,"norm":{"1":0,"15":0,"5":0.01}}}}}} 2019-07-25T04:18:42.822Z INFO [monitoring] log/log.go:154 Uptime: 2h12m3.591338628s 2019-07-25T04:18:42.822Z INFO [monitoring] log/log.go:131 Stopping metrics logging. 2019-07-25T04:18:42.822Z INFO instance/beat.go:431 metricbeat stopped.**
What is the reason behind this? I am not able to figure out the issue from the above stack trace. Looking forward for some help. Thanks