Open cmacknz opened 1 year ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
I'm facing this problem too! Elastic version 8.6.0
Same here with 8.6.0
I'm facing the same issues with 8.6.0 self managed
@cmacknz I'm not quite sure, but there seems to be a related issue that leads to panic: https://github.com/elastic/beats/issues/34384
The cluster uuid is required for Stack Monitoring application to properly tie a Beat to its Elasticsearch cluster. This is mainly driven by the business logic of SM, as without this information the application would show an incorrect state for the impacted beat processes.
Given this issue should be transient and disappear once Beats successfully connects to ES, is there a need a suppress this warning ? If the issue persists it would surface a deeper problem in the monitored Beat process, and at this point it is valuable to get that logged. Should we consider a lower logging level ? Should the beats API not return a successful response unless it is consistent with its configuration ?
I think the root cause here is that the Beats lazily connect to Elasticsearch when they have events to send. So Filebeat for example will not connect for the first time until there is data to send.
This can lead to valid situations where we are repeatedly seeing this log message because the file being monitored hasn't updated since the last time Filebeat was started.
@belimawr and I spoke and a better solution to this problem is likely to make an initial connection attempt as soon as the Beat is initialized so we can grab the cluster UUID and also detect if something is wrong in the output configuration much earlier.
Generally this log message is harmless and is just log spam, because if the Beat has tried and failed to connect to Elasticsearch there will be other more obvious errors related to that in the logs.
@cmacknz the importance of the message is not questioned. The problem is the flood of error severity messages in the agent log that creates way too much noise.
I'll look into reducing the logs occurrence and lowering the severity of the message considering that a failure to connect to the ES output would already be logged
@cmacknz Is there some way to verify which Beat is still waiting to connect to Elasticsearch? And is there some Beat setup in the default Agent settings that would lazily connect like this? So that we can check that the error indeed goes away once that Beat has a reason to send its first document.
All the Beats lazily connect as far as I know, Metricbeat and Filebeat certainly do.
If you can modify the Beat code for this experiment, I would just add a log statement when the clusterUUIDFetchingCallback
is registered and another one when it is actually executed.
Without modifying the Beat, in the agent logs you'll see something like the following when a Beat does eventually connect to Elasticsearch:
{"log.level":"info","@timestamp":"2023-03-22T08:54:21.468Z","message":"Connection to backoff(elasticsearch(https://$domain.europe-west1.gcp.cloud.es.io:443)) established","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"log-default","type":"log"},"log":{"source":"log-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","log.origin":{"file.line":147,"file.name":"pipeline/client_worker.go"},"ecs.version":"1.6.0"}
Hi! We just realized that we haven't looked into this issue in a while. We're sorry!
We're labeling this issue as Stale
to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1
.
Thank you for your contribution!
The Elastic agent uses the Metricbeat beat stats module to collect metrics for the Beats it starts. Until those Beats connect to Elasticsearch the agent logs will be full of errors like the one below that aren't particularly helpful. The Beats only obtain a cluster UUID when they publish their first event, so for example if there is a log source that never updates or is slow to change this can appear in the agent logs quite frequently.
This error is coming from this code:
https://github.com/elastic/beats/blob/64f98cacc874e225ffb66c0f29741c66ff841636/metricbeat/module/beat/stats/stats.go#L79-L100
Why do we need an ES cluster UUID to collect beat stats? Is there a way to bypass this or suppress this warning?