elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.06k stars 4.89k forks source link

[Elastic Agent] The system/metrics input should report itself as degraded when it encounters a permissions error #39737

Open cmacknz opened 1 month ago

cmacknz commented 1 month ago

When the system/metrics input in the Elastic Agent is run as part of an unprivileged agent, it will fail to collect metrics for some processes and fail to open some file it uses as a data source for certain metricsets. Today these problems are only visible in Elastic Agent logs. An example from the diagnostics in https://github.com/elastic/elastic-agent/issues/4647 follows below.

{"log.level":"debug","@timestamp":"2024-05-02T05:49:00.137Z","message":"Error fetching PID info for 1216, skipping: GetInfoForPid: could not get all information for PID 1216: error fetching name: OpenProcess failed for pid=1216: Access is denied.\nerror fetching status: OpenProcess failed for pid=1216: Access is denied.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":173,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidIter"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

Use the work done in https://github.com/elastic/beats/issues/39736 to set the input to degraded when it encounters a permissions error like the one above attempting to read data for a metricset.

elasticmachine commented 1 month ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz commented 1 month ago

One concern I have about this input is that we have seen failures to read permissions outside of the unprivileged agent use case, for example we were unable to read data from endpoint-security due to it running as a protected process on Windows.

We need to be careful we do not create a plague of degraded agents for benign or known errors that can't be fixed. We may need to make the reporting for this input optional, perhaps on a per metricset basis.

nimarezainia commented 1 month ago

One concern I have about this input is that we have seen failures to read permissions outside of the unprivileged agent use case, for example we were unable to read data from endpoint-security due to it running as a protected process on Windows.

We need to be careful we do not create a plague of degraded agents for benign or known errors that can't be fixed. We may need to make the reporting for this input optional, perhaps on a per metricset basis.

For benign I agree, but for known errors - I think these should be reported. Otherwise, we are showing that the agent is healthy but in actual fact there is an error.

cmacknz commented 1 month ago

Agree we should show the error but I think we'll want to be able to disable certain types of errors to prevent the system integration from making every agent degraded by default for weeks or months depending on where in the release schedule our fix lands.

In general not being able to read a system metricset or access a particular PID is worth reporting, but once known I don't think the agent needs to be reported as unhealthy continuously as this will make other, potentially more serious errors harder to notice.

For a recent example (that is now fixed), every agent with defend+system installed on Windows would have been reported as degraded permanently as the system integration failed to read information from Defend's PID. This is important to know, but doesn't need to be continuously flagged to the user for every agent they have once known.

nimarezainia commented 1 month ago

is there a way we could identify and then throttle these continuous errors? so say after the 10th error received, we can flag that at the agent/fleet level for investigation but revert the agent to healthy? but knowing that there are persistent errors that may not necessarily be a reason for an agent degradation warning.

cmacknz commented 1 month ago

There are a few ways to approach this. One would be a configuration option the keeps the errors in the logs, but filters them from marking the agent as degraded in Fleet. I think this is reasonable and can be in scope for this issue.

We could additionally rate limit the error messages themselves, or perhaps log a periodic summary error for all metricsets that encountered permissions errors in a given interval. So rather than 10 metricsets generating 10 individual permissions error log lines, we write 1 log line that includes the 10 affected metricsets. If we want this, I think this needs to be a separate implementation issue.