elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.06k stars 4.89k forks source link

Metricbeat stops after some time #13066

Closed akmi closed 2 years ago

akmi commented 4 years ago

Metric beat stops after some time in all my instances with the same stack trace as shown below:

**2019-07-25T04:18:40.303Z DEBUG [elasticsearch] elasticsearch/client.go:339 PublishEvents: 20 events have been published to elasticsearch in 25.030741ms. 2019-07-25T04:18:40.304Z DEBUG [publisher] memqueue/ackloop.go:160 ackloop: receive ack [792: 0, 20] 2019-07-25T04:18:40.304Z DEBUG [publisher] memqueue/eventloop.go:535 broker ACK events: count=13, start-seq=10287, end-seq=10299

2019-07-25T04:18:40.304Z DEBUG [publisher] memqueue/eventloop.go:535 broker ACK events: count=7, start-seq=925, end-seq=931

2019-07-25T04:18:40.304Z DEBUG [publisher] memqueue/ackloop.go:128 ackloop: return ack to broker loop:20 2019-07-25T04:18:40.304Z DEBUG [publisher] memqueue/ackloop.go:131 ackloop: done send ack 2019-07-25T04:18:42.820Z DEBUG [service] service/service.go:55 Received sighup, stopping 2019-07-25T04:18:42.820Z INFO cfgfile/reload.go:230 Dynamic config reloader stopped 2019-07-25T04:18:42.820Z INFO [reload] cfgfile/list.go:118 Stopping 3 runners ... 2019-07-25T04:18:42.820Z DEBUG [reload] cfgfile/list.go:129 Stopping runner: system [metricsets=2] 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:149 client: closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:151 client: done closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:155 client: cancelled 0 events 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=fsstat, host=] 2019-07-25T04:18:42.820Z DEBUG [reload] cfgfile/list.go:129 Stopping runner: system [metricsets=1] 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:149 client: closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:151 client: done closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:155 client: cancelled 0 events 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=uptime, host=] 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:145 Stopped Wrapper[name=system, len(metricSetWrappers)=1] 2019-07-25T04:18:42.820Z DEBUG [reload] cfgfile/list.go:131 Stopped runner: system [metricsets=1] 2019-07-25T04:18:42.820Z DEBUG [reload] cfgfile/list.go:129 Stopping runner: system [metricsets=7] 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:149 client: closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:151 client: done closing acker 2019-07-25T04:18:42.820Z DEBUG [publisher] pipeline/client.go:155 client: cancelled 0 events 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=cpu, host=] 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=filesystem, host=] 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:145 Stopped Wrapper[name=system, len(metricSetWrappers)=2] 2019-07-25T04:18:42.820Z DEBUG [reload] cfgfile/list.go:131 Stopped runner: system [metricsets=2] 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=process, host=] 2019-07-25T04:18:42.820Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=memory, host=] 2019-07-25T04:18:42.821Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=network, host=] 2019-07-25T04:18:42.821Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=process_summary, host=] 2019-07-25T04:18:42.821Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=socket_summary, host=] 2019-07-25T04:18:42.821Z DEBUG [module] module/wrapper.go:202 Stopped metricSetWrapper[module=system, name=load, host=] 2019-07-25T04:18:42.821Z DEBUG [module] module/wrapper.go:145 Stopped Wrapper[name=system, len(metricSetWrappers)=7] 2019-07-25T04:18:42.821Z DEBUG [reload] cfgfile/list.go:131 Stopped runner: system [metricsets=7] 2019-07-25T04:18:42.822Z INFO [monitoring] log/log.go:153 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":17480,"time":{"ms":17486}},"total":{"ticks":45780,"time":{"ms":45795},"value":45780},"user":{"ticks":28300,"time":{"ms":28309}}},"handles":{"limit":{"hard":1048576,"soft":1024},"open":6},"info":{"ephemeral_id":"8f1d9c9b-509d-4878-8ba1-a818dc12fa34","uptime":{"ms":7923590}},"memstats":{"gc_next":12860064,"memory_alloc":6875352,"memory_total":7567558816,"rss":58175488},"runtime":{"goroutines":14}},"libbeat":{"config":{"module":{"running":0},"reloads":1},"output":{"events":{"acked":11239,"batches":793,"total":11239},"read":{"bytes":350161},"type":"elasticsearch","write":{"bytes":27570717}},"pipeline":{"clients":0,"events":{"active":0,"published":11239,"retry":25,"total":11239},"queue":{"acked":11239}}},"metricbeat":{"system":{"cpu":{"events":793,"success":793},"filesystem":{"events":798,"success":798},"fsstat":{"events":133,"success":133},"load":{"events":793,"success":793},"memory":{"events":793,"success":793},"network":{"events":1586,"success":1586},"process":{"events":4748,"success":4748},"process_summary":{"events":793,"success":793},"socket_summary":{"events":793,"success":793},"uptime":{"events":9,"success":9}}},"system":{"cpu":{"cores":2},"load":{"1":0,"15":0,"5":0.02,"norm":{"1":0,"15":0,"5":0.01}}}}}} 2019-07-25T04:18:42.822Z INFO [monitoring] log/log.go:154 Uptime: 2h12m3.591338628s 2019-07-25T04:18:42.822Z INFO [monitoring] log/log.go:131 Stopping metrics logging. 2019-07-25T04:18:42.822Z INFO instance/beat.go:431 metricbeat stopped.**

What is the reason behind this? I am not able to figure out the issue from the above stack trace. Looking forward for some help. Thanks

Pomyk commented 4 years ago

2019-07-25T04:18:42.820Z DEBUG [service] service/service.go:55 Received sighup, stopping

Looks like something is sending SIGPHUP signal to the metricbeat process.

terminalmage commented 4 years ago

I'm seeing the same behavior on CentOS 6. This is happening in our automated testing environment, within Docker. metricbeat will start successfully, connect to kafka for output, but get SIGHUP'ed by the kernel after just a few seconds. If I manually start the service after this happens, it runs just fine.

I strace'd the process to see the source of the SIGHUP, and it's coming from the kernel:

--- SIGHUP {si_signo=SIGHUP, si_code=SI_KERNEL} ---
rt_sigprocmask(SIG_UNBLOCK, [HUP], NULL, 8) = 0
gettid()                                = 2332
tkill(2332, SIGHUP)                     = 0
--- SIGHUP {si_signo=SIGHUP, si_code=SI_TKILL, si_pid=2332, si_uid=0} ---
rt_sigaction(SIGHUP, {SIG_DFL, ~[], SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x7f3567f817e0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [HUP], NULL, 8) = 0
gettid()                                = 2332
tkill(2332, SIGHUP)                     = 0
--- SIGHUP {si_signo=SIGHUP, si_code=SI_TKILL, si_pid=2332, si_uid=0} ---
+++ killed by SIGHUP +++

However, nothing ran the initscript to stop metricbeat, which I was able to ascertain by logging into the Docker instance and adding the following line between when metricbeat was installed and when it was started:

echo "ran initscript with opts: $@" >>/tmp/mbinit.log

The result of this was the following:

ran initscript with opts: status
ran initscript with opts: start
ran initscript with opts: status
ran initscript with opts: status
ran initscript with opts: status

So for some reason the kernel is SIGHUP'ing metricbeat when it starts the first time, but not any subsequent time, and I have no idea why. I tried several 6.x and 7.x releases of metricbeat and found the same behavior.

terminalmage commented 4 years ago

I just switched from Docker to Vagrant and I get the same behavior on CentOS 6.10 in both. If anything, I would expect a SIGHUP to be sent to metricbeat-god instead of metricbeat.

I've ruled out selinux as it is disabled in our test environment for both Docker and Vagrant.

terminalmage commented 4 years ago

I should also note that I don't see this behavior at all in CentOS 7.

terminalmage commented 4 years ago

Some further info: I'm starting metricbeat via test-kitchen. It's making an SSH connection to run its commands, and when it runs the init script, it's starting as a foreground process (note that the below is captured in the several-second interval when the process is still running, before test-kitchen disconnects from SSH):

[kitchen@localhost ~]$ ps axc -o pid,ppid,tty,stat,time,command | grep metricbeat | grep -v grep
 7731     1 pts/1    S+   00:00:00 metricbeat-god
 7732  7731 pts/1    Sl+  00:00:00 metricbeat

Note the +, identifying it as a foreground process. Once test-kitchen disconnects from SSH, the TTY is closed and the process is killed. Note that when I SSH into the container and run the initscript, the process doesn't have the "+" next to it:

[kitchen@localhost ~]$ sudo /sbin/service metricbeat start
Starting metricbeat:                                       [  OK  ]
[kitchen@localhost ~]$ ps axc -o pid,ppid,tty,stat,time,command | grep metricbeat | grep -v grep
 9719     1 pts/0    S    00:00:00 metricbeat-god
 9720  9719 pts/0    Sl   00:00:00 metricbeat

Looks like this might just be an upstream bug, at least in my case.

botelastic[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

terminalmage commented 3 years ago

This is still very much a problem.

elasticmachine commented 3 years ago

Pinging @elastic/integrations-services (Team:Services)

botelastic[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

terminalmage commented 2 years ago

Still a problem.

Sebor commented 2 years ago

I have the same issue in k8s (as prometheus remote write).

k8s container status:

terminated
Reason: Completed - exit code: 0

logs:

021-08-10T13:58:55.726Z INFO    [monitoring]    log/log.go:144  Non-zero metrics in the last 30s    {"monitoring": {"metrics": {"beat":{"cgroup":{"cpu":{"stats":{"periods":301,"throttled":{"ns":31997241754,"periods":261}}},"cpuacct":{"total":{"ns":2949814152}},"memory":{"mem":{"usage":{"bytes":49082368}}}},"cpu":{"system":{"ticks":520,"time":{"ms":134}},"total":{"ticks":6460,"time":{"ms":2407},"value":6460},"user":{"ticks":5940,"time":{"ms":2273}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":211},"info":{"ephemeral_id":"fa8851e1-75dc-43c5-9cb5-ca1203e92c50","uptime":{"ms":91799}},"memstats":{"gc_next":167169808,"memory_alloc":135552736,"memory_sys":851968,"memory_total":845988144,"rss":479932416},"runtime":{"goroutines":309}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":350,"active":50,"batches":7,"total":350},"read":{"bytes":6758},"write":{"bytes":399078}},"pipeline":{"clients":1,"events":{"active":4117,"published":350,"total":350},"queue":{"acked":350}}},"metricbeat":{"prometheus":{"remote_write":{"events":350,"success":350}}},"system":{"load":{"1":2.78,"15":3.09,"5":2.54,"norm":{"1":1.39,"15":1.545,"5":1.27}}}}}}
2021-08-10T13:59:32.220Z    INFO    [monitoring]    log/log.go:144  Non-zero metrics in the last 30s    {"monitoring": {"metrics": {"beat":{"cgroup":{"cpu":{"stats":{"periods":300,"throttled":{"ns":30203836939,"periods":254}}},"cpuacct":{"total":{"ns":2992549419}},"memory":{"mem":{"usage":{"bytes":39141376}}}},"cpu":{"system":{"ticks":600,"time":{"ms":166}},"total":{"ticks":8910,"time":{"ms":2956},"value":8910},"user":{"ticks":8310,"time":{"ms":2790}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":200},"info":{"ephemeral_id":"fa8851e1-75dc-43c5-9cb5-ca1203e92c50","uptime":{"ms":121798}},"memstats":{"gc_next":387447184,"memory_alloc":351301280,"memory_sys":69536776,"memory_total":1179790592,"rss":518823936},"runtime":{"goroutines":175}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":150,"active":50,"batches":3,"total":150},"read":{"bytes":2891},"write":{"bytes":154000}},"pipeline":{"clients":1,"events":{"active":4117,"published":150,"total":150},"queue":{"acked":150}}},"metricbeat":{"prometheus":{"remote_write":{"events":150,"success":150}}},"system":{"load":{"1":2.46,"15":3.04,"5":2.47,"norm":{"1":1.23,"15":1.52,"5":1.235}}}}}}
2021-08-10T13:59:34.023Z    INFO    [monitoring]    log/log.go:152  Total non-zero metrics  {"monitoring": {"metrics": {"beat":{"cgroup":{"cpu":{"cfs":{"period":{"us":100000},"quota":{"us":10000}},"id":"d33fb58de20a7da059e418dd4eba741dedf3d3e8f9445eb0de8ab6c5943b9a9b","stats":{"periods":1208,"throttled":{"ns":119198093406,"periods":1012}}},"cpuacct":{"id":"d33fb58de20a7da059e418dd4eba741dedf3d3e8f9445eb0de8ab6c5943b9a9b","total":{"ns":11980896782}},"memory":{"id":"d33fb58de20a7da059e418dd4eba741dedf3d3e8f9445eb0de8ab6c5943b9a9b","mem":{"limit":{"bytes":536870912},"usage":{"bytes":489992192}}}},"cpu":{"system":{"ticks":690,"time":{"ms":694}},"total":{"ticks":9600,"time":{"ms":9609},"value":9600},"user":{"ticks":8910,"time":{"ms":8915}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":84},"info":{"ephemeral_id":"fa8851e1-75dc-43c5-9cb5-ca1203e92c50","uptime":{"ms":130299}},"memstats":{"gc_next":487526256,"memory_alloc":369537568,"memory_sys":430970920,"memory_total":1251564552,"rss":562049024},"runtime":{"goroutines":116}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":1100,"active":50,"batches":23,"total":1150},"read":{"bytes":28325},"type":"elasticsearch","write":{"bytes":1149018}},"pipeline":{"clients":0,"events":{"active":4116,"failed":4,"published":5216,"retry":50,"total":5220},"queue":{"acked":1100,"max_events":4096}}},"system":{"cpu":{"cores":2},"load":{"1":2.26,"15":3.03,"5":2.43,"norm":{"1":1.13,"15":1.515,"5":1.215}}}}}}
2021-08-10T13:59:34.023Z    INFO    [monitoring]    log/log.go:153  Uptime: 2m10.304298888s
2021-08-10T13:59:34.023Z    INFO    [monitoring]    log/log.go:130  Stopping metrics logging.
2021-08-10T13:59:34.023Z    INFO    [api]   api/server.go:66    Stats endpoint (127.0.0.1:5066) finished: accept tcp 127.0.0.1:5066: use of closed network connection
2021-08-10T13:59:34.023Z    INFO    instance/beat.go:479    metricbeat stopped.
jlind23 commented 2 years ago

@Sebor do you still face this issue with the latest metricbeat binary?

elasticmachine commented 2 years ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Sebor commented 2 years ago

@jlind23 I can't answer the question because I gave up metricbeat :)

jlind23 commented 2 years ago

Backlog grooming: Closing for now until further update.

terminalmage commented 2 years ago

haha, whatever

jlind23 commented 2 years ago

@terminalmage reopen it as it seems that you are still facing it.

cmacknz commented 2 years ago

There are a few different reports in this issue, I'm not sure if they all have the same root cause.

Metricbeat (and I think all the beats) are explicitly handling SIGHUP and using it as a signal to gracefully exit. See https://github.com/elastic/beats/commit/e9537aa779f1988a1f33ef6028a908615eb19e34

Metricbeat terminating when it receives SIGHUP because the terminal it was run as a foreground process from is closed is the behaviour you would expect from this.

Metricbeat itself isn't in control of when it receives SIGHUP, so I'm not sure what fix we could provide here besides changing the signal handling behaviour.

terminalmage commented 2 years ago

It's being run by sysvinit, not directly from the TTY that SSH'ed in. And I don't see this behavior with any other daemons in our test infra, so this isn't "expected behavior" IMO.

But to be honest, I worked around this in our test suite (quite literally) years ago. The problem doesn't appear under systemd, and CentOS 6 has reached EOL since I first reported the behavior. At this point, I probably care as much about this being fixed as you do. :man_shrugging:

cmacknz commented 2 years ago

Thanks for the follow up. We'll assume this was some quirk of sysvinit on CentOS 6 unless other reports of this on more recent versions show up.

42oach commented 5 months ago

Same issue in Ubuntu, Metricbeat stops automatically,