elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 134 forks source link

Configuring agent.logging.level through agent policy do not work for managed agents #2851

Closed nchaulet closed 4 months ago

nchaulet commented 1 year ago

Issue

We now have a feature that allow to add arbitrary key to the agent policy sent to the agent, https://github.com/elastic/kibana/pull/159414#issuecomment-1587160242

I tried to set agent log level this way to Fleet managed agents, and it seems the agent is still logging at the default log level, when I run elastic-agent inspect I saw that my agent.logging.level: debug config is here, but it seems not used by the agent

Definition of done

cmacknz commented 1 year ago

Seems like the logging level in the policy is ignored in the policy change handler, we always persist the log level we started with.

https://github.com/elastic/elastic-agent/blob/409682e5efba99f6b4ff034a503b1d0827a67b0b/internal/pkg/agent/application/actions/handlers/handler_action_policy_change.go#L204

https://github.com/elastic/elastic-agent/blob/409682e5efba99f6b4ff034a503b1d0827a67b0b/internal/pkg/agent/application/actions/handlers/handler_action_policy_change.go#L274

We only support changing the log level through the SETTINGS action but I have no idea why we did it this way.

https://github.com/elastic/elastic-agent/blob/409682e5efba99f6b4ff034a503b1d0827a67b0b/internal/pkg/agent/application/actions/handlers/handler_action_settings.go#L53

jlind23 commented 1 year ago

But agent debug logs can still be turned out with the Ui change and not the api call Nicolas mentioned an I correct?

cmacknz commented 1 year ago

Yes, this is unintuitive but you can still change the log level from the UI with the SETTINGS action.

jlind23 commented 1 year ago

So at least we have a workaround for this..

pchila commented 1 year ago

@cmacknz I had a look at the code and I guess that we ignore the log level in the handler_policy_change_action due to an issue with deserialization: the log level is defined as an int8 within agent where 0 corresponds to info level and we use a map to convert the string we receive in the policy: when unmarshaling a policy that does not contain the log level (all of them unless we use the new fleet entry point), we are probably going to get and use the zero value (an empty string) which will likely throw an error (to be tested).

To solve this issue we have to optionally unpack the log level only when the key is specified and fall back on the configured value if it's not. Another possible solution would be to transform the Level into a *Level and then add the current log level value if the specified Level is nil (probably a bit cleaner) but we can't change it easily as it's part of elastic-agent-libs... :disappointed:

cmacknz commented 1 year ago

but we can't change it easily as it's part of elastic-agent-libs...

The two largest users of elastic-agent-libs are Agent and Beats (and things that implement a Beat outside of the Beats repository). As long as the change wouldn't be breaking for Beats we could do this if it makes more sense, and if it is breaking we could just make it an option or variant of the existing behavior.

michalpristas commented 1 year ago

bringing a bit of a history into this. the proposed change should not work in a way how agent is designed. i'm seeing this as a feature request not a bug.

the idea behind POLICY and SETTING was this POLICY contains information about Inputs, Output and Fleet connection info SETTINGS is agent specific setting

when we start thinking about supporting setting Log Level using settings I'd like us to stop for a while and think about how this should behave first.

At this time it' simple, default log level is info, we persist it as a setting. Once there's a need to switch to debug you will switch to debug and we persist it as a configuration option, this overrides default one. Once you're done you'll go back to Info.

Once you introduce setting log level using POLICY at the start we end up with a pair info:nil meaning default using settings is info, and it is not set using policy

then you change policy to use warn ending up with info:warn the question is which one has priority, we dont have information about the source of the first one.

Even if we consider not setting info at start, ending up with nil:warn. We will use warn.

Now setting from Debug SETTINGS will come, we have debug:warn using debug with a higher priority as it is per agent config.

The question is, how we reset per agent config. if we set whatever, we will end up with whatever:warn with whatever always taking priority even with future Policy changes.

When I take a look at diagnostics I should be deterministically say what's the log level based on Policy and Settings without worrying about order.

From my point of view we should either

Getting rid of SETTINGs does not make sense. In the need of debug, changing log level to Debug for potentially 100k agent would create a spike in processing requirement, memory and cost eventually.

@cmacknz offline this week, pinging @joshdover: do we have thought through use cases for this?

joshdover commented 1 year ago

Thanks for the history, @michalpristas. Makes sense why this is behaving this way.

I think this is somewhat clear already, but just to be concise the use cases we have are:

  1. Have a sensible default log level (info)
  2. Set the log level for a group of agents (policy)
    • Example use case: trying to find a infrequent bug that is happening in a pool of agents pulling from an SQS queue
  3. Increase the log verbosity on a single agent as a one-off to debug an issue, without affecting other agents on the policy

IMO the precedence order should be:

In general we should track which agents have a one-off setting applied via the SETTINGS action and should flag those to the user in the UI. This is essentially "drift detection" for agents that are deviating from their policy, and the user should be encouraged to reset those agents back to the policy.

add Inherit log level or Reset function on Fleet UI which would result in pair nil:whatever

Yeah I agree with this, there needs to be some way to reset (3) and start using (2) again.

So in conclusion, I agree this is not a bug, but a new feature and use case: controlling the log level via the agent policy. We've seen this request before and it should be prioritized, but as an enhancement and not a bug.

joshdover commented 1 year ago

@pierrehilbert If you agree, let's deprioritize this one for now.

pierrehilbert commented 1 year ago

Thanks everyone for your insights and I agree with the precedence order mentioned by @joshdover. I changed the issue type and priority according to this new agreement. Still keeping this in the current sprint for now but with a lower priority compare to other items and will see if we need some other adjustments.

michalpristas commented 1 year ago

keep in mind that in order for this to work properly we need reset fucntionality or Policy log level in fleet in place. Without this once you set log level using settings all log levels coming from Policy will be ignored.

from planning perspective it does not make sense to work on this until fleet part is at least scheduled for development

joshdover commented 1 year ago

Agreed, we will need support for this in Fleet. It's likely a small enhancement, but needs to be planned for.

elasticmachine commented 4 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)