elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
16 stars 144 forks source link

[pkg/component] - A component's health should also depend on health of units #5386

Open VihasMakwana opened 2 months ago

VihasMakwana commented 2 months ago

Describe the enhancement:

Describe a specific use case for the enhancement or feature:

What is the definition of done?

elasticmachine commented 2 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

VihasMakwana commented 2 months ago

Please share your thoughts!

blakerouse commented 2 months ago

I think the original idea on the split health status of an overall component versus a unit was that we can see the component itself is healthy but the unit that is running is not.

I think in practice most users just review the component health and don't look at individual units. I think taking an aggregated approach of make the unit status reflect the component status does make sense.

I am +1 for this type of change, if done correctly. I don't want to lose the context of the component health, so maybe we need to have two status levels for a component. The overall health of the component (including the aggregation of the units) and then a single health state for the components communication with the Elastic Agent.

cmacknz commented 2 months ago

I think in practice most users just review the component health and don't look at individual units. I think taking an aggregated approach of make the unit status reflect the component status does make sense.

From a user perspective I agree. The only case that worries me is the upgrade watcher:

https://github.com/elastic/elastic-agent/blob/d8bdd71429bf57ee60e55f927d94fed357b12516/internal/pkg/agent/application/upgrade/watcher.go#L222-L228

If we made this change today, and had a single failed unit set the component state to failed, agent would begin rolling back upgrades because of unit level errors. We need to decide if this is behavior we want. My preference is to leave this unchanged so we ignore unit errors when deciding to roll back.