Open jlind23 opened 3 months ago
Pinging @elastic/fleet (Team:Fleet)
What should we do with offline Elastic Agents?
Is it useful for us to know how many offline agents there are? If so, is it useful for us to know how many offline agents there are per version?
Is it useful for us to know how many offline agents there are? If so, is it useful for us to know how many offline agents there are per version?
I don't think it is useful as they will eventually become healthy/online one thing would be the ability to see if whether or not agent are staying offline without ever being considered as inactive.
Agree with the above. I think we can simply not include offline agents.
Is it useful for us to know how many offline agents there are? If so, is it useful for us to know how many offline agents there are per version?
Agents are not always online. For agents deployed on end user machines (like our InfoSec agents) it is typical that those machines would be turned off or hibernating outside core work hours for the employee that uses the machine.
This will result in the total count of agents fluctuating with the standard work hours of various time zones.
A worst case example would be if we monitored a 5000 employee customer who strictly worked 9-5 hours in one timezone. We would observe them have 5000 agents for 8 hours a day and 0 online agents for the other 16 hours.
Excluding inactive agents makes sense, excluding offline agents is going to be misleading.
A worst case example would be if we monitored a 5000 employee customer who strictly worked 9-5 hours in one timezone. We would observe them have 5000 agents for 8 hours a day and 0 online agents for the other 16 hours.
Fair point, and I think this example illustrates the potential problems with how we analyze agent counts well.
If we're counting unhealthy agents we also need to count offline agents, because offline indicates that an unhealthy agent has been unhealthy for a long enough period of time. I think for our counting purposes these two states are equivalent.
Today in our telemetry we report a field
agents_per_version.count
in thefleet-agents
index which contains all Elastic Agent statuses (Healthy, inactive, offline, unenrolled, ..).In order to build a more meaningful telemetry this field should only contain:
Inactive and unenrolled should be removed.
Question:
cc @kpollich @nimarezainia