elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
20 stars 144 forks source link

Properly support disabling providers and changing init time settings in containers #4145

Closed cmacknz closed 9 months ago

cmacknz commented 9 months ago

Recently there have been several situations where users have needed to disable the leader election provider. Providers in agent cannot be configured in Fleet today, but even if they were they require a restart of the agent to take effect, which also isn't supported through Fleet. The providers are initialized when the composable controller is created at startup:

https://github.com/elastic/elastic-agent/blob/9d25f79b159744801daefca36ae227674e28d920/internal/pkg/composable/controller.go#L52-L71

It is possible to disable providers by editing the elastic-agent.yml file read by the agent container when it starts, which on Kubernetes is most easily accomplished by mounting the file as a ConfigMap. Essentially the process is:

  1. Mount the elastic-agent.yml as a ConfigMap in the container.
  2. Mount the ConfigMap as a volume
  3. Ensure the agent is configured to read the configuration file mounted from that volume.
  4. Mounting the ConfigMap as a volume creates a bind mount inside the container, which has the effect that the agent's attempt to overwrite the initial configuration on enrollment fails (because the bind mount is essentially a different file system and we can't move the file) and the flag disabling the providers is preserved.
  5. The enrollment API key and agent ID are stored in the state path which is usually stored on a host path volume, so subsequent restarts of the pod do not attempt to re-enroll and skip over trying to replace the contents of the ConfigMap. This work around does not work if you do not preserve the state path between restarts of the container.

A simplified example follows:

apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-yml
  namespace: kube-system
  labels:
    app: elastic-agent
data:
  agent.yml: |-
    providers.kubernetes_leaderelection.enabled: false
    fleet.enabled: true
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: elastic-agent
  # ...
spec:
   # ...
    spec:
      containers:
        - name: elastic-agent
          image: docker.elastic.co/beats/elastic-agent:8.12.0
          args: ["-c", "/etc/elastic-agent/agent.yml", "-e"]
          volumeMounts:
            - name: agent-yml
              mountPath: /etc/elastic-agent/agent.yml
              readOnly: true
              subPath: agent.yml
      volumes:
        - name: agent-yml
          configMap:
            defaultMode: 416
            name: agent-yml

The problem we hit is that we try to unconditionally replace the local agent configuration (in this case a bind mounted ConfigMap) with our default Fleet configuration. This is just an empty configuration that sets fleet.enabled: true:

We unconditionally try to rotate the file during enrollment, which happens every time the agent container starts when the state path isn't persisted outside of the container file system. This happens in: https://github.com/elastic/elastic-agent/blob/9d25f79b159744801daefca36ae227674e28d920/internal/pkg/agent/cmd/enroll_cmd.go#L173-L178

The code path that does this continues with the SafeFileRotate rotate call, which is what fails for the bind mounted ConfigMap: https://github.com/elastic/elastic-agent/blob/9d25f79b159744801daefca36ae227674e28d920/internal/pkg/agent/storage/replace_store.go#L59-L75)

What we could do is change this to only attempt to replace the file if it doesn't already contain fleet.enabled: true (or any other key that isn't commented out in our default fleet configuration).

This would allow overriding the initial contents of the elastic-agent.yml contents in a container in general, regardless of if those settings are available in Fleet.

In the case of providers, even if we did allow disabling leaderelection in the UI it would still be enabled until the agent receives the first policy change from Fleet, so disabling it in the initial configuration like this is likely the preferred route.

elasticmachine commented 9 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

strawgate commented 9 months ago

Would it be possible for us to do this via an environment var instead of mounting a config?

cmacknz commented 9 months ago

The most specific change we could make would be to introduce a variable listing the providers to disable, since they are all enabled by default. Something like ELASTIC_AGENT_DISABLE_PROVIDERS that takes a comma separated list of providers to disable.

That would only fix this for providers and wouldn't cover any other cases where we'd want to modify the configuration (the initial logging level for one example). We could do both what is described in this issue and add an environment variable as a convenience.

Something like https://github.com/elastic/elastic-agent/issues/3609 to only run providers that exist in the policy sounds nice, but integrations that use leader election would then have to support disabling it in the inputs they use.

olegsu commented 9 months ago

The ability to disable providers is important for our team and also another use-case we have is to enable traces agent.monitoring.traces to ship the agent metrics to custom APM instances. Thank you @cmacknz cc @eyalkraft

eyalkraft commented 9 months ago

What we could do is change this to only attempt to replace the file if it doesn't already contain fleet.enabled: true (or any other key that isn't commented out in our default fleet configuration).

This would allow overriding the initial contents of the elastic-agent.yml contents in a container in general, regardless of if those settings are available in Fleet.

@cmacknz While this will work for now, I'm not sure it's a sufficient solution in our case.

The enabled by default nature of providers could be problematic for agentless. Currently speaking, we don't want/need any provider enabled, and we don't want any future provider to be implicitly enabled.

Trying to maintain a comprehensive up to date list of the providers in order to disable them like we do here is less than ideal (funny enough, @olegsu and I just noticed missing the env provider that you folks recommended we disable (issue)).

What are your thoughts about a configuration option to change this default behavior of the providers? providers_default_disable: which is false when not specified, ending up with the current enabled by default behavior, And when true will only activate providers which are explicitly configured.

cmacknz commented 9 months ago

What are your thoughts about a configuration option to change this default behavior of the providers? providers_default_disable: which is false when not specified, ending up with the current enabled by default behavior, And when true will only activate providers which are explicitly configured.

Something like this makes sense to address the maintenance concern you are raising. The solution in https://github.com/elastic/elastic-agent/issues/3609 is overall nicer in that it gets rid of enabling and disabling entirely, but it would be significantly more work.

cmacknz commented 9 months ago

Currently speaking, we don't want/need any provider enabled, and we don't want any future provider to be implicitly enabled.

Taking this into account, https://github.com/elastic/elastic-agent/issues/3609 wouldn't work because it wouldn't stop an integration from accidentally enabling a provider by referencing data it populates in the policy.

So adding a flag to unconditionally disable providers makes sense to me as the best path forward.

blakerouse commented 9 months ago

Something like #3609 to only run providers that exist in the policy sounds nice, but integrations that use leader election would then have to support disabling it in the inputs they use.

@cmacknz Can you explain what you meant here more? If we solved #3609 would it break leader election?

cmacknz commented 9 months ago

I made a bit of an assumption that in an agentless deployment we want things like the host provider disabled permanently with no way to turn it on, because it will leak information about the machine hosting the agent. This possibly has security implications, and even if it didn't is leaking implementation details back to the user.