Agent's Coordinator regenerates its component model whenever it receives a change to its policy or variables. Whether this generation succeeds depends on both the variables and the policy -- some policy updates may succeed with one set of variables but not another. An example where this becomes a serious problem is the following (abbreviated) input config:
This always expands to a bad policy because the EQL syntax has an error: if the user wants to check their input flag, they need to use ${env_input_enabled} = "true" instead.
Now suppose this policy is sent to an Agent that doesn't yet know its value for kubernetes.pod.ip (or whatever other context variable the config depends on). Agent silently skips any inputs with missing variables, and it stops checking the rest of the policy as soon as it finds one, so the condition field isn't validated. This policy change will generate a valid component model that omits this input, and it will be reported to Fleet as successful.
If the Kubernetes metadata is then refreshed, producing new variables, Agent will try again to generate its component model, and will fail when it reaches condition. It will then enter an unhealthy state no matter what the values of the previously missing variables are.
The core problem here is that our AST processing that generates the component model depends on the current values of the variables -- this error could be detected and reported when we first receive the policy change, but we only verify the parts of the policy that are in active use. Instead, we should validate/preprocess the whole policy regardless of what the variables are, leaving the variable substitution for last, so we know that we can still produce a well-formed component model for any variables we are given. (This doesn't guarantee that the resulting components will always be healthy, but it guarantees that we at least have an unambiguous configuration to give them.)
Note: this issue had different symptoms prior to 8.8. In older versions, invalid EQL syntax wasn't reported as an error, but instead silently evaluated to false (changed in this PR). In that case, this policy wouldn't report an explicit error, but would instead silently skip the configured input no matter what variables were set.
Agent's Coordinator regenerates its component model whenever it receives a change to its policy or variables. Whether this generation succeeds depends on both the variables and the policy -- some policy updates may succeed with one set of variables but not another. An example where this becomes a serious problem is the following (abbreviated) input config:
This always expands to a bad policy because the EQL syntax has an error: if the user wants to check their input flag, they need to use
${env_input_enabled} = "true"
instead.Now suppose this policy is sent to an Agent that doesn't yet know its value for
kubernetes.pod.ip
(or whatever other context variable the config depends on). Agent silently skips any inputs with missing variables, and it stops checking the rest of the policy as soon as it finds one, so thecondition
field isn't validated. This policy change will generate a valid component model that omits this input, and it will be reported to Fleet as successful.If the Kubernetes metadata is then refreshed, producing new variables, Agent will try again to generate its component model, and will fail when it reaches
condition
. It will then enter an unhealthy state no matter what the values of the previously missing variables are.The core problem here is that our AST processing that generates the component model depends on the current values of the variables -- this error could be detected and reported when we first receive the policy change, but we only verify the parts of the policy that are in active use. Instead, we should validate/preprocess the whole policy regardless of what the variables are, leaving the variable substitution for last, so we know that we can still produce a well-formed component model for any variables we are given. (This doesn't guarantee that the resulting components will always be healthy, but it guarantees that we at least have an unambiguous configuration to give them.)
Note: this issue had different symptoms prior to 8.8. In older versions, invalid EQL syntax wasn't reported as an error, but instead silently evaluated to false (changed in this PR). In that case, this policy wouldn't report an explicit error, but would instead silently skip the configured input no matter what variables were set.
Related issues: