Open andrzej-stencel opened 2 months ago
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
If my understanding is correct, this is because the previous versions of the agent write the agent's default log level (info) in their config even if the log level wasn't explicitly set by the user. This effectively prevents the policy log level to take effect, as the agent-level setting has precedence and there's no way to tell whether the log level set for the agent was set by the user or is just the default.
So, just to confirm, if we start with enrolling a fresh 8.15.0
Agent in Fleet (as opposed to upgrading to it from an older version of Agent), this problem does not exist?
We either need to fix this (@pchila may have thought about this already), or document this behavior clearly. My concern with just documenting it is that this has to happen every time an agent or group of agents is upgraded, which for large security deployments will be a significant pain (e.g. machines being offline for weeks due employee time off coming back with the wrong log level).
We either need to fix this (@pchila may have thought about this already), or document this behavior clearly. My concern with just documenting it is that this has to happen every time an agent or group of agents is upgraded, which for large security deployments will be a significant pain (e.g. machines being offline for weeks due employee time off coming back with the wrong log level).
There is no way to properly fix this as the agent has no settings migration mechanism to be triggered when upgrading (and possibly rolling back) and the default value set by older version of agent is indistinguishable from a value set by a user.
and the default value set by older version of agent is indistinguishable from a value set by a user
If a user has set the log level on the entire agent policy, it should override anything that was configured before. Why doesn't this solve the problem, why can't we just override it with the value from the policy?
That's the opposite of what was asked in the original issue, so that if there's a log level for a given agent, it takes priority over the policy setting...
OK I see now, we also need to support overriding the log level of a single agent via the SETTINGS action, and we have no way to know if that was done. Agent didn't record the source of the log level change, so we can't tell if this is where it came from.
However, Fleet should know whether the log level is currently overridden outside the policy via the value in the per agent log level box, and could tell us this so we know that we can safely override what was stored before when we get the policy log level. The SETTINGS action exists to allow changing the log level when processing the policy fails, this doesn't mean we can't also include a per agent override level in the policy itself so agent can explicitly compute the resulting log level whenever we get a policy change action, independent of the SETTINGS action.
There are also users who have worked around this missing feature by installing the agents at a specific log level and never changing it. In this case, the log level should only change when they opt in to setting the log level in the policy or via use of the SETTINGS action for an individual agent.
If Fleet clears the agent-specific loglevel, everything will work as expected (as part of the implementation SettingsAction allows for clearing the log level now...)
Right, the concern is a user needing to manually do this for every agent that existed before 8.15.0, which for every existing user, is all of them.
@pchila @cmacknz Summarizing the discussion so far, it seems to me that:
Is this assessment accurate? Asking so we can prioritize this issue correctly, i.e. most likely in the current sprint itself.
We need to fix this problem ASAP, as opposed to documenting the workaround, as the impacted user base is basically all existing users.
ASAP is maybe a bit stronger than is necessary. The feature will not intuitively work as well as it could, but I don't think this breaks anything. It definitely doesn't break anything if the code ships and nobody uses it to set the log level in a policy, so we could just not mention it in the release notes until we figure out the needed polish here.
ASAP is maybe a bit stronger than is necessary. The feature will not intuitively work as well as it could, but I don't think this breaks anything. It definitely doesn't break anything if the code ships and nobody uses it to set the log level in a policy, so we could just not mention it in the release notes until we figure out the needed polish here.
If users upgrade their Agent to 8.15.X and change the log level only afterwards then it would work as expected. This problem only exist if:
Am I correct?
Am I correct?
Not exactly: the issue would manifest on any elastic-agent that will be upgraded to 8.15.0, regardless of the fact that the user tried to use the feature before the upgrade. (Edit after re-reading @jlind23 's comment as I read too fast)
The main issue here is that any agent < 8.15 would write at least info
(default value) or any other log level set via Fleet on the specific agent as log level for itself in fleet.enc
The new feature will not apply the log level specified in the policy if a log level is present in fleet.enc
, hence the issue.
We could solve this in a couple of different ways:
info
The only option that does not require a user intervention is 1.
Options 2. and 3. are meant to avoid the user the tedium of going through every single agent , open the detail, select the Log
tab and click reset to policy log level
.
Moreover 2. and 3. won't work on agents < 8.15 (before this feature, agent would not allow for an empty log level set for itself via Fleet) so trying to send an empty log level to old agent would result in an action error :disappointed: .
As of now the only workaround is for users to clear the specific log level one agent at a time or reenroll all their agents using a new policy after upgrade, neither of which is really desirable or friendly
Have some config migration mechanism so that when the agent upgrades to 8.15 it can clear the saved value if it's info
Why only if it is info? What happens if it is debug and the policy level is error?
We can't rely purely on the log level in the policy because a previous SETTINGS action may have overridden it, correct? Is the SETTINGS action the only case we need to deal with?
Why only if it is info? What happens if it is debug and the policy level is error?
Because INFO is the default log level any agent <= 8.14 would write in fleet.enc
If there's something like debug
or warn
log level saved in fleet.enc
, somebody set it explicitly
We can't rely purely on the log level in the policy because a previous SETTINGS action may have overridden it, correct? Is the SETTINGS action the only case we need to deal with?
Correct. And yes, before 8.15, SETTINGS action is the only way a managed agent could receive a log level
debug
.sudo elastic-agent logs -f
and check if debug logs are being logged.Expected output:
Actual output:
Workaround:
To make the policy log level effective, user needs to go to the agent's "Log" page in Fleet - Agents - my-agent - Logs and click on the "Reset to policy" link at the bottom of the page.
If my understanding is correct, this is because the previous versions of the agent write the agent's default log level (
info
) in their config even if the log level wasn't explicitly set by the user. This effectively prevents the policy log level to take effect, as the agent-level setting has precedence and there's no way to tell whether the log level set for the agent was set by the user or is just the default. @pchila please keep me honest here.