elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
18 stars 144 forks source link

[Discuss] Ease debugging of Elastic Agent providers #5324

Open jlind23 opened 2 months ago

jlind23 commented 2 months ago

While working on some kubernetes issues we were stuck trying to figure who the leaders were. As of today, the only option is to run the following command:

kubectl get leases.coordination.k8s.io -n kube-system | grep elastic-agent

In order to ease debugging it would be great to bubble up this information in Kibana UI somewhere in order to know:

This brought a global discussion of what are the information each providers should return and make available:

@nimarezainia @strawgate happy to get your thoughts on this.

cc @ycombinator @blakerouse as you recently worked on similar cases.

elasticmachine commented 2 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

nimarezainia commented 2 months ago

@jlind23 in your debugging, what was the workflow? As in once you found which agent is the nominated leader, what was the next steps you were trying to perform? I'm trying to figure out where this information should reside. Perhaps it can just be part of the local metadata and not something that the UI needs to show.

jlind23 commented 2 months ago

We wanted to ensure why one of the data exclusively collected by the leader wasn't sent hence the reason why we were looking for the leader.

jlind23 commented 2 months ago

@blakerouse @ycombinator we need to report it through the Elastic Agent metadata, would you happen to know how complex this will be? Once reported we can do whatever we want with it in the UI.

strawgate commented 2 months ago

With the Helm Chart, do we actually use leader election?

jlind23 commented 2 months ago

According to @pkoutsovasilis' demo I don't think we do.

blakerouse commented 2 months ago

@jlind23 The leader election is its own provider and not something that has any connection to Fleet or updating the overall core state of the Elastic Agent. It will be difficult to connect the two, so its not a quick change.

It should be possible to see that the state_* units are only running on the Elastic Agent that has leader election is on.

pkoutsovasilis commented 2 months ago

Hi 👋 So the Helm chart for the built-in kubernetes integration and standalone mode disables leader election as it deploys multiple agents under daemonset for node-scope metrics and containers logs, deployment for cluster-scope metrics and statefulset with kube-state-metrics container alongside the agent on to monitor kube-state-metrics, thus no need for leader election. On the contrary, the same "topology" isn't possible for managed agents through Fleet, since config now is controlled by the latter, thus in that scenario the Helm chart doesn't disable it.

Thinking out loud since an agent instance knows whether it is the leader or not, and when it won the won/lost election can't this be propagated to Kibana?!

blakerouse commented 2 months ago

Hi 👋 So the Helm chart for the built-in kubernetes integration and standalone mode disables leader election as it deploys multiple agents under daemonset for node-scope metrics and containers logs, deployment for cluster-scope metrics and statefulset with kube-state-metrics container alongside the agent on to monitor kube-state-metrics, thus no need for leader election. On the contrary, the same "topology" isn't possible for managed agents through Fleet, since config now is controlled by the latter, thus in that scenario the Helm chart doesn't disable it.

It is actually possible to have the Elastic Agent deployed as a deployment with kube-state-metrics and enrolled into Fleet if that Elastic Agent was enrolled into a custom policy that only enabled state_* metrics. Another option would be to set an ENV on the container and then add a condition on the integration for that ENV, so that only the container with that ENV variable would run the state_* metrics.

Just want to make it clear that it is possible, but the current way the integration and the manifests are designed it doesn't operate that way.

This is not a limitation of the Elastic Agent, its just a limitation on how the manifests and integrations have been designed.

Thinking out loud since an agent instance knows whether it is the leader or not, and when it won the won/lost election can't this be propagated to Kibana?!

It is absolutely possible, but not something that is directly wired into the Elastic Agent currently. If we wanted to add this information to Kibana it might be better to add extra information from other providers as well. Possible that each provider could publish a status (just like components). That would also allow say the kubernetes provider in a non-kubernetes environment to say its not running as its unable to connect.

I think that also brings about the ability to configure providers in Fleet. Possible this just highlights that we should make providers a top-level thing in Fleet.

pkoutsovasilis commented 2 months ago

It is actually possible to have the Elastic Agent deployed as a deployment with kube-state-metrics and enrolled into Fleet if that Elastic Agent was enrolled into a custom policy that only enabled state* metrics. Another option would be to set an ENV on the container and then add a condition on the integration for that ENV, so that only the container with that ENV variable would run the state* metrics.

yep I have done such an enrollment so it is possible; however somebody can enable other metrics in the integration which might results in undesired effects and there is no way to limit that at least as far as I can tell

Just want to make it clear that it is possible, but the current way the integration and the manifests are designed it doesn't operate that way.

yep 100% agree, the reason that made us take that decision with the Helm chart (not disabling leader election for managed mode) wasn't a limitation of Agent but rather how an integration, at least as of now, gets applied holistically

It is absolutely possible, but not something that is directly wired into the Elastic Agent currently. If we wanted to add this information to Kibana it might be better to add extra information from other providers as well. Possible that each provider could publish a status (just like components). That would also allow say the kubernetes provider in a non-kubernetes environment to say its not running as its unable to connect.

I think that also brings about the ability to configure providers in Fleet. Possible this just highlights that we should make providers a top-level thing in Fleet.

yep being able to configure providers in Fleet and expose them like components with a status does sound like a good addition to explore that could be helpful

jlind23 commented 2 months ago

It is absolutely possible, but not something that is directly wired into the Elastic Agent currently. If we wanted to add this information to Kibana it might be better to add extra information from other providers as well. Possible that each provider could publish a status (just like components). That would also allow say the kubernetes provider in a non-kubernetes environment to say its not running as its unable to connect. I think that also brings about the ability to configure providers in Fleet. Possible this just highlights that we should make providers a top-level thing in Fleet.

I am leaning towards updating this issue to focus on each providers and make sure they are returning the right set of informations.

blakerouse commented 2 months ago

It is absolutely possible, but not something that is directly wired into the Elastic Agent currently. If we wanted to add this information to Kibana it might be better to add extra information from other providers as well. Possible that each provider could publish a status (just like components). That would also allow say the kubernetes provider in a non-kubernetes environment to say its not running as its unable to connect. I think that also brings about the ability to configure providers in Fleet. Possible this just highlights that we should make providers a top-level thing in Fleet.

I am leaning towards updating this issue to focus on each providers and make sure they are returning the right set of informations.

This would actually be more inline with OTel as well, as each extension can also report a status. This alignment will help the transition over time.