The Elasticsearch output should not report itself as degraded based only on the time between events

cmacknz commented 1 year ago

The implementation in https://github.com/elastic/elastic-agent-shipper/issues/174 was incompletely specified. We currently consider the shipper Elasticsearch output degraded whenever it has not written events to Elasticsearch within the past 30 seconds: https://github.com/elastic/elastic-agent-shipper/pull/239

This is an ok proxy for inability to connect to Elasticsearch, but does not consider the impact on low volume log sources. Users could tune the timeout, but this isn't something they've traditionally had to do and may lead to false positive degraded states.

Instead we should only mark the shipper as degraded when we have not published events for 30 seconds, and we have detected an explicitly error attempting to connect to Elasticsearch. For example this would include connection refused errors, failed DNS lookups, or invalid credentials.

The most common reasons for failing to connect to Elasticsearch would be incorrect proxy configurations, connectivity outages, or invalidated API keys. We should address these cases specifically instead of using a catch all timeout that makes assumptions about the steady state event rate.

cmacknz commented 1 year ago

This may have been addressed by https://github.com/elastic/elastic-agent-shipper/pull/296/

@fearful-symmetry do we still consider the shipper ES output as unhealthy if no messages are sent in the absence of any errors during a time period?

fearful-symmetry commented 1 year ago

@cmacknz with the output as it stands now, it should only report unhealthy if it only receives errors over a given period. So, no messages over a given period will not change the health reporting.

cmacknz commented 1 year ago

Perfect, thanks for noticing + fixing that. Closing this.

elastic / elastic-agent-shipper

The Elasticsearch output should not report itself as degraded based only on the time between events #301