elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
112 stars 4.93k forks source link

Metricbeat: The beat/stats module will frequently log errors about missing cluster UUIDs #34217

Open cmacknz opened 1 year ago

cmacknz commented 1 year ago

The Elastic agent uses the Metricbeat beat stats module to collect metrics for the Beats it starts. Until those Beats connect to Elasticsearch the agent logs will be full of errors like the one below that aren't particularly helpful. The Beats only obtain a cluster UUID when they publish their first event, so for example if there is a log source that never updates or is slow to change this can appear in the agent logs quite frequently.

{"log.level":"error","@timestamp":"2022-12-22T14:26:36.306Z","message":"Error fetching data for metricset beat.stats: monitored beat is using Elasticsearch output but cluster UUID cannot be determined","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"ecs.version":"1.6.0","log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"service.name":"metricbeat","ecs.version":"1.6.0"}

This error is coming from this code:

https://github.com/elastic/beats/blob/64f98cacc874e225ffb66c0f29741c66ff841636/metricbeat/module/beat/stats/stats.go#L79-L100

Why do we need an ES cluster UUID to collect beat stats? Is there a way to bypass this or suppress this warning?

elasticmachine commented 1 year ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

JAmorimNeon commented 1 year ago

I'm facing this problem too! Elastic version 8.6.0

herbc2 commented 1 year ago

Same here with 8.6.0

engarpe commented 1 year ago

I'm facing the same issues with 8.6.0 self managed

belimawr commented 1 year ago

@cmacknz I'm not quite sure, but there seems to be a related issue that leads to panic: https://github.com/elastic/beats/issues/34384

klacabane commented 1 year ago

The cluster uuid is required for Stack Monitoring application to properly tie a Beat to its Elasticsearch cluster. This is mainly driven by the business logic of SM, as without this information the application would show an incorrect state for the impacted beat processes.

Given this issue should be transient and disappear once Beats successfully connects to ES, is there a need a suppress this warning ? If the issue persists it would surface a deeper problem in the monitored Beat process, and at this point it is valuable to get that logged. Should we consider a lower logging level ? Should the beats API not return a successful response unless it is consistent with its configuration ?

cmacknz commented 1 year ago

I think the root cause here is that the Beats lazily connect to Elasticsearch when they have events to send. So Filebeat for example will not connect for the first time until there is data to send.

This can lead to valid situations where we are repeatedly seeing this log message because the file being monitored hasn't updated since the last time Filebeat was started.

@belimawr and I spoke and a better solution to this problem is likely to make an initial connection attempt as soon as the Beat is initialized so we can grab the cluster UUID and also detect if something is wrong in the output configuration much earlier.

cmacknz commented 1 year ago

Generally this log message is harmless and is just log spam, because if the Beat has tried and failed to connect to Elasticsearch there will be other more obvious errors related to that in the logs.

yevgenytrcloudzone commented 1 year ago

@cmacknz the importance of the message is not questioned. The problem is the flood of error severity messages in the agent log that creates way too much noise.

klacabane commented 1 year ago

I'll look into reducing the logs occurrence and lowering the severity of the message considering that a failure to connect to the ES output would already be logged

miltonhultgren commented 1 year ago

@cmacknz Is there some way to verify which Beat is still waiting to connect to Elasticsearch? And is there some Beat setup in the default Agent settings that would lazily connect like this? So that we can check that the error indeed goes away once that Beat has a reason to send its first document.

cmacknz commented 1 year ago

All the Beats lazily connect as far as I know, Metricbeat and Filebeat certainly do.

If you can modify the Beat code for this experiment, I would just add a log statement when the clusterUUIDFetchingCallback is registered and another one when it is actually executed.

https://github.com/elastic/beats/blob/0587bb0d175a7cf338e8280ea03ee74b7a2b3b96/libbeat/cmd/instance/beat.go#L1135

Without modifying the Beat, in the agent logs you'll see something like the following when a Beat does eventually connect to Elasticsearch:

{"log.level":"info","@timestamp":"2023-03-22T08:54:21.468Z","message":"Connection to backoff(elasticsearch(https://$domain.europe-west1.gcp.cloud.es.io:443)) established","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"log-default","type":"log"},"log":{"source":"log-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","log.origin":{"file.line":147,"file.name":"pipeline/client_worker.go"},"ecs.version":"1.6.0"}
botelastic[bot] commented 3 weeks ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!