elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
103 stars 4.92k forks source link

Metricbeat windows service metrics stops sending documents when a single service fails #40765

Open TheRiffRafi opened 2 months ago

TheRiffRafi commented 2 months ago

Multiple instances of elastic-agent installations are failing to send the windows.service metric set for the windows integration. The system integration continues to send data without issues. The problem happens at random and it is resolved by restarting the elastic agent. ~The issue happens in different versions of 8.x for elastic-agent and it hasn't confirmed as occurring on the latest version (as the user who has experienced this has not upgraded to latest version yet).~ The issue so far has only been seen on 8.10.4

The error reported by metricbeat is the following:

{"log.level":"error","@timestamp":"2024-07-29T20:49:33.157Z","message":"Error fetching data for metricset windows.service: OpenProcess failed for pid=1724: The parameter is incorrect.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"windows/metrics-default","type":"windows/metrics"},"log":{"source":"windows/metrics-default"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

So far the error indicates a problem only with one particular windows service, however, all other services being monitored by metricbeat can't continue to be monitored because this particular service getting in an unexpected state causes the entire metricbeat windows service metricset to stop reporting for any service.

Because this happens at random we are unable to setup debug logging to catch the failure and the logger for this function is not providing any more info.

We need to address 2 items with this issue:

  1. The windows service monitoring stops sending stats for ANY service once a single service gets into a weird state (this fits a bug description).
  2. There is no logger that specifies what that weird state was, nor an indication as to why sending service metrics for other services stops working (this fits a feature request that may or may not be necessary to address point 1).
elasticmachine commented 2 months ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz commented 2 months ago

@VihasMakwana I think I saw you had root caused the source of the OpenProcess failed for pid=1724: The parameter is incorrect error elsewhere? Or am I misremembering?

VihasMakwana commented 2 months ago

@cmacknz yes, that's correct.

On my personal desktop, the metricbeat wasn't able to access following processes, running as root:

This was for system.process integration though. The above issue is about windows.service integration but I believe the root cause is similar.


@TheRiffRafi do you see any warning related to SeDebugPrivilege at the beginning of logs? Something like: Metricbeat is running without SeDebugPrivilege, a Windows privilege that allows it to collect metrics..., Failure while attempting to enable SeDebugPrivilege or Metricbeat failed to enable the SeDebugPrivilege? Can you attach logs from beginning, if possible?

TheRiffRafi commented 1 month ago

Hello @VihasMakwana!

Unfortunately I can't help with logs, all the instances I have of the failure have the logs with the problem already started, there is no instance of this where we've caught it in a state where the issue is not occurring and then suddenly starts happening (the systems are going weeks without reporting the service).

Also, I have to make a correction on the original description, we have only seen this on 8.10.4, we haven't tested on a more recent version as the entire stack for the user is still on 8.10.4, it was a misunderstanding that we had seen this problem on a later version.