elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
16 stars 143 forks source link

Agent partially applies new fleet-server TLS configuration #5888

Open AndersonQ opened 2 days ago

AndersonQ commented 2 days ago

The Elastic Agent can receive porxy and TLS configuration for fleet-server thorough its CLI, during install/enroll, and the policy. The configurations from the policy take precedence over the CLI, however and empty configuration from the policy does not changes the current configuration.

If the Elastic Agent is installed with a proxy using mTLS and the policy has another proxy configured with simple/one way TLS, the agent will apply the new proxy and CA, but keep the old certificate, certificate key and key passphrase/key passphrase path. Letting the agent on an inconsistent state. Such state might even lead to a failure to start if the old certificate-related configurations are files and the files are removed. If it happens, on start up, the agent will try to load those files and fail, preventing the agent from starting.

The culprit is https://github.com/elastic/elastic-agent/blob/0580e532c35171f7acd8738a9e9fd61d0c189eb1/internal/pkg/agent/application/actions/handlers/handler_action_policy_change.go#L201-L264

which handles the proxy, TLS Certificate and the TLS CA configurations as different entities, when they should be treated as one.

In other words, if either change, the proxy or the TLS certificate or the CAs, they all should be replaced. If there is any change in policyConfig.Transport, apply the whole new Transport. Right now as proxy, TLS certificate and CA are handled separately, if the CLI defines a proxy with mTLS and the policy has another proxy with one way TLS, the result config will be a mix of both instead of only the new proxy with one way TLS. The expected result is to have the new proxy URL+headers and the CAs, but the certificate and certificate key must be cleared.

For confirmed bugs, please report:

elasticmachine commented 2 days ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

AndersonQ commented 2 days ago

@pierrehilbert, @nimarezainia, @cmacknz

we need to fully define and document what use-cases we'll cover with the proxy and TLS configuration. There are possible conflicting scenarios when the agent does not receive a proxy, certificate+key+passphrase or CAs.

just like that: https://xkcd.com/1172/

AndersonQ commented 2 days ago

@leehinman I guess you're the one who can help to define what are real use cases or not

nimarezainia commented 1 day ago

@pierrehilbert, @nimarezainia, @cmacknz

we need to fully define and document what use-cases we'll cover with the proxy and TLS configuration. There are possible conflicting scenarios when the agent does not receive a proxy, certificate+key+passphrase or CAs.

just like that: https://xkcd.com/1172/

@AndersonQ the use cases that we need to support should be defined already. I had a doc sometime ago and permutations of how we connect via a proxy are documented, I can forward if needed. I would say that we need to fix the code to ensure those use cases are functioning properly.

is this about the proxy having a different tls config?

AndersonQ commented 1 day ago

is this about the proxy having a different tls config?

yes, about a new proxy removing any of the tls config. Either the clien certificates or the CA.

Considering the control plane connection: agent -> fleet-server, TLS is configured through the CLI and a proxy might be added The cli configuration has precedence over empty/null/absent configuration from the policy.

Depending on type of the proxy the agent might still have to perform a TLS handshake with fleet-server (pass-through proxy) Right now the agent considers 3 entities for the control plane connections (consider all TLS config is set as a path to a file):

each of them can are evaluated in isolation and follow the precedence: policy > cli > empty/null/absent

which means, once set, any of them can only be changed, never erased

This allows the agent to be individually configured, having different agents with different TLS and proxy configuration in the same policy.

However, right now, once a mTLS proxy is configured, it's not possible to remove the mutual TLS configuration. In a scenario where the user adds a new proxy to the policy with one-way TLS, the agent will keep the client certificate configured. Which isn't an immediate problem as if the proxy does not require the agent to send its certificate, the agent will not send it. Therefore, the on-way TLS will work as expected.

The side effect is that if the client certificate (certificate, certificate key and key passphrase path) are a path and are deleted, the agent might not start up anymore as it'll look for the certificate (certificate, certificate key and key passphrase path).

I think it's more about we well defining and document what can and cannot be done. We have the precedence working as expected: policy > cli > empty/null/absent From the UI the proxy and its TLS configuration seems to be one thing, but on the agent side it is not. Also the agent does not differentiate between TLS for the proxy and directly to fleet-server.

The less intrusive is to keep it as is. However I believe we need to document clearly the implications of it and eventually, if needed, work to support removing proxy and TLS configurations thought the policy

cmacknz commented 1 day ago

From the UI the proxy and its TLS configuration seems to be one thing, but on the agent side it is not. Also the agent does not differentiate between TLS for the proxy and directly to fleet-server.

From a UI perspective, if you are adding a proxy and that proxy requires mTLS, then having the UI present all the TLS config together makes sense. It also is necessary to have a single atomic policy change for all of these parameters at once.

The UI below looks right in the context of adding a new proxy. This is assuming that the certificate and certificate key boxes are the client mTLS certificate and key. If they are, we should make that explicit.

Image

I think generally if someone is using mTLS it is likely to be a universal requirement of their deployment and not a quirk of a specific proxy, so it is highly unlikely mTLS is going to be suddenly removed. That said, not being able to remove mTLS if it was enabled at install time is a problem that will limit our ability to debug issues.

I suspect the core problem here is that clearing the proxy client cert and key has no effect if agent was configured locally with a client cert and key.

It should be possible for Fleet to tell agent to ignore the install time cert and key. This is a more general problem for parameters we allow to be set at install time that is not specific to this particular issue.

Unless I am missing something, we should open a separate issue to allow giving Fleet complete control over the agent configuration regardless of what got set on the command line at install or enroll time.

AndersonQ commented 4 hours ago

From a UI perspective, if you are adding a proxy and that proxy requires mTLS, then having the UI present all the TLS config together makes sense. It also is necessary to have a single atomic policy change for all of these parameters at once.

It's already atomic in the sense the agent either applies all changes or none if the validation fails. However it'll apply the precedence "entity by entity".

I suspect the core problem here is that clearing the proxy client cert and key has no effect if agent was configured locally with a client cert and key.

or through the policy. Not being able to remove a previously set config the is the expected and current behaviour.

It should be possible for Fleet to tell agent to ignore the install time cert and key. This is a more general problem for parameters we allow to be set at install time that is not specific to this particular issue. Unless I am missing something, we should open a separate issue to allow giving Fleet complete control over the agent configuration regardless of what got set on the command line at install or enroll time.

That is the dream.

What I believe we need to decide is if, while we do not have this full control from fleet, the agent will treat proxy, CA, client cert as a single entity or keep the current behaviour.

It's my understanding it makes more send to make them all a single entity, which, given the proxy address has changed, would allow to remove CAs and the client certificate/key.

AndersonQ commented 4 hours ago

One thing is worth pointing out, the agent won't apply a configuration that causes loss of connectivity. Therefore, regardless the inconsistent state it might end up on, it'll still be able to reach fleet-server