elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.73k stars 8.14k forks source link

[Fleet] Improve telemetry around number of Elastic Agent #184570

Open jlind23 opened 3 months ago

jlind23 commented 3 months ago

Today in our telemetry we report a field agents_per_version.count in the fleet-agents index which contains all Elastic Agent statuses (Healthy, inactive, offline, unenrolled, ..).

In order to build a more meaningful telemetry this field should only contain:

Inactive and unenrolled should be removed.

Question:

cc @kpollich @nimarezainia

elasticmachine commented 3 months ago

Pinging @elastic/fleet (Team:Fleet)

kpollich commented 3 months ago

What should we do with offline Elastic Agents?

Is it useful for us to know how many offline agents there are? If so, is it useful for us to know how many offline agents there are per version?

jlind23 commented 3 months ago

Is it useful for us to know how many offline agents there are? If so, is it useful for us to know how many offline agents there are per version?

I don't think it is useful as they will eventually become healthy/online one thing would be the ability to see if whether or not agent are staying offline without ever being considered as inactive.

kpollich commented 3 months ago

Agree with the above. I think we can simply not include offline agents.

cmacknz commented 3 months ago

Is it useful for us to know how many offline agents there are? If so, is it useful for us to know how many offline agents there are per version?

Agents are not always online. For agents deployed on end user machines (like our InfoSec agents) it is typical that those machines would be turned off or hibernating outside core work hours for the employee that uses the machine.

This will result in the total count of agents fluctuating with the standard work hours of various time zones.

A worst case example would be if we monitored a 5000 employee customer who strictly worked 9-5 hours in one timezone. We would observe them have 5000 agents for 8 hours a day and 0 online agents for the other 16 hours.

Excluding inactive agents makes sense, excluding offline agents is going to be misleading.

kpollich commented 3 months ago

A worst case example would be if we monitored a 5000 employee customer who strictly worked 9-5 hours in one timezone. We would observe them have 5000 agents for 8 hours a day and 0 online agents for the other 16 hours.

Fair point, and I think this example illustrates the potential problems with how we analyze agent counts well.

If we're counting unhealthy agents we also need to count offline agents, because offline indicates that an unhealthy agent has been unhealthy for a long enough period of time. I think for our counting purposes these two states are equivalent.