elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 134 forks source link

Agent policy logging level is not applied to agents upgraded from pre-8.15.0 #5116

Open andrzej-stencel opened 2 months ago

andrzej-stencel commented 2 months ago
  1. Create a Hosted deployment with version 8.14.2
  2. Install an agent (8.14.2), enroll it in Fleet.
  3. Upgrade the deployment to 8.15.0.
  4. Upgrade the agent to 8.15.0.
  5. Go to Fleet - Agent policies, click on the policy for the upgraded agent, click on Settings tab, and change the logging level to debug.
  6. Run sudo elastic-agent logs -f and check if debug logs are being logged.

Expected output:

Actual output:

Workaround:

To make the policy log level effective, user needs to go to the agent's "Log" page in Fleet - Agents - my-agent - Logs and click on the "Reset to policy" link at the bottom of the page.

If my understanding is correct, this is because the previous versions of the agent write the agent's default log level (info) in their config even if the log level wasn't explicitly set by the user. This effectively prevents the policy log level to take effect, as the agent-level setting has precedence and there's no way to tell whether the log level set for the agent was set by the user or is just the default. @pchila please keep me honest here.

elasticmachine commented 2 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

ycombinator commented 2 months ago

If my understanding is correct, this is because the previous versions of the agent write the agent's default log level (info) in their config even if the log level wasn't explicitly set by the user. This effectively prevents the policy log level to take effect, as the agent-level setting has precedence and there's no way to tell whether the log level set for the agent was set by the user or is just the default.

So, just to confirm, if we start with enrolling a fresh 8.15.0 Agent in Fleet (as opposed to upgrading to it from an older version of Agent), this problem does not exist?

cmacknz commented 2 months ago

We either need to fix this (@pchila may have thought about this already), or document this behavior clearly. My concern with just documenting it is that this has to happen every time an agent or group of agents is upgraded, which for large security deployments will be a significant pain (e.g. machines being offline for weeks due employee time off coming back with the wrong log level).

pchila commented 2 months ago

We either need to fix this (@pchila may have thought about this already), or document this behavior clearly. My concern with just documenting it is that this has to happen every time an agent or group of agents is upgraded, which for large security deployments will be a significant pain (e.g. machines being offline for weeks due employee time off coming back with the wrong log level).

There is no way to properly fix this as the agent has no settings migration mechanism to be triggered when upgrading (and possibly rolling back) and the default value set by older version of agent is indistinguishable from a value set by a user.

cmacknz commented 2 months ago

and the default value set by older version of agent is indistinguishable from a value set by a user

If a user has set the log level on the entire agent policy, it should override anything that was configured before. Why doesn't this solve the problem, why can't we just override it with the value from the policy?

pchila commented 2 months ago

That's the opposite of what was asked in the original issue, so that if there's a log level for a given agent, it takes priority over the policy setting...

cmacknz commented 2 months ago

OK I see now, we also need to support overriding the log level of a single agent via the SETTINGS action, and we have no way to know if that was done. Agent didn't record the source of the log level change, so we can't tell if this is where it came from.

However, Fleet should know whether the log level is currently overridden outside the policy via the value in the per agent log level box, and could tell us this so we know that we can safely override what was stored before when we get the policy log level. The SETTINGS action exists to allow changing the log level when processing the policy fails, this doesn't mean we can't also include a per agent override level in the policy itself so agent can explicitly compute the resulting log level whenever we get a policy change action, independent of the SETTINGS action.

There are also users who have worked around this missing feature by installing the agents at a specific log level and never changing it. In this case, the log level should only change when they opt in to setting the log level in the policy or via use of the SETTINGS action for an individual agent.

pchila commented 2 months ago

If Fleet clears the agent-specific loglevel, everything will work as expected (as part of the implementation SettingsAction allows for clearing the log level now...)

cmacknz commented 2 months ago

Right, the concern is a user needing to manually do this for every agent that existed before 8.15.0, which for every existing user, is all of them.

ycombinator commented 2 months ago

@pchila @cmacknz Summarizing the discussion so far, it seems to me that:

  1. We need to fix this problem ASAP, as opposed to documenting the workaround, as the impacted user base is basically all existing users.
  2. There is a way to fix this but it's not trivial — it will require building out a settings migration mechanism.

Is this assessment accurate? Asking so we can prioritize this issue correctly, i.e. most likely in the current sprint itself.

cmacknz commented 2 months ago

We need to fix this problem ASAP, as opposed to documenting the workaround, as the impacted user base is basically all existing users.

ASAP is maybe a bit stronger than is necessary. The feature will not intuitively work as well as it could, but I don't think this breaks anything. It definitely doesn't break anything if the code ships and nobody uses it to set the log level in a policy, so we could just not mention it in the release notes until we figure out the needed polish here.

jlind23 commented 2 months ago

ASAP is maybe a bit stronger than is necessary. The feature will not intuitively work as well as it could, but I don't think this breaks anything. It definitely doesn't break anything if the code ships and nobody uses it to set the log level in a policy, so we could just not mention it in the release notes until we figure out the needed polish here.

If users upgrade their Agent to 8.15.X and change the log level only afterwards then it would work as expected. This problem only exist if:

Am I correct?

pchila commented 2 months ago

Am I correct?

Not exactly: the issue would manifest on any elastic-agent that will be upgraded to 8.15.0, regardless of the fact that the user tried to use the feature before the upgrade. (Edit after re-reading @jlind23 's comment as I read too fast)

The main issue here is that any agent < 8.15 would write at least info (default value) or any other log level set via Fleet on the specific agent as log level for itself in fleet.enc The new feature will not apply the log level specified in the policy if a log level is present in fleet.enc, hence the issue.

We could solve this in a couple of different ways:

  1. Have some config migration mechanism so that when the agent upgrades to 8.15 it can clear the saved value if it's info
  2. Have some Fleet UI flow that would allow the user to select multiple agents (maybe directly by policy) and reset the agent-specific log level in a single action (some clever filter and select from the list of managed agent and an extra action should be enough)
  3. Have some popup asking if the agent-specific log levels should be reset when the user sets the log level for the first time in a policy (it's a slight variation of 2. , sligthly more user-friendly)

The only option that does not require a user intervention is 1. Options 2. and 3. are meant to avoid the user the tedium of going through every single agent , open the detail, select the Log tab and click reset to policy log level. Moreover 2. and 3. won't work on agents < 8.15 (before this feature, agent would not allow for an empty log level set for itself via Fleet) so trying to send an empty log level to old agent would result in an action error :disappointed: .

As of now the only workaround is for users to clear the specific log level one agent at a time or reenroll all their agents using a new policy after upgrade, neither of which is really desirable or friendly

cmacknz commented 2 months ago

Have some config migration mechanism so that when the agent upgrades to 8.15 it can clear the saved value if it's info

Why only if it is info? What happens if it is debug and the policy level is error?

We can't rely purely on the log level in the policy because a previous SETTINGS action may have overridden it, correct? Is the SETTINGS action the only case we need to deal with?

pchila commented 2 months ago

Why only if it is info? What happens if it is debug and the policy level is error?

Because INFO is the default log level any agent <= 8.14 would write in fleet.enc If there's something like debug or warn log level saved in fleet.enc, somebody set it explicitly

We can't rely purely on the log level in the policy because a previous SETTINGS action may have overridden it, correct? Is the SETTINGS action the only case we need to deal with?

Correct. And yes, before 8.15, SETTINGS action is the only way a managed agent could receive a log level