Open AndersonQ opened 3 weeks ago
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
@pierrehilbert, @nimarezainia, @cmacknz
we need to fully define and document what use-cases we'll cover with the proxy and TLS configuration. There are possible conflicting scenarios when the agent does not receive a proxy, certificate+key+passphrase or CAs.
just like that: https://xkcd.com/1172/
@leehinman I guess you're the one who can help to define what are real use cases or not
@pierrehilbert, @nimarezainia, @cmacknz
we need to fully define and document what use-cases we'll cover with the proxy and TLS configuration. There are possible conflicting scenarios when the agent does not receive a proxy, certificate+key+passphrase or CAs.
just like that: https://xkcd.com/1172/
@AndersonQ the use cases that we need to support should be defined already. I had a doc sometime ago and permutations of how we connect via a proxy are documented, I can forward if needed. I would say that we need to fix the code to ensure those use cases are functioning properly.
is this about the proxy having a different tls config?
is this about the proxy having a different tls config?
yes, about a new proxy removing any of the tls config. Either the clien certificates or the CA.
Considering the control plane connection: agent -> fleet-server, TLS is configured through the CLI and a proxy might be added The cli configuration has precedence over empty/null/absent configuration from the policy.
Depending on type of the proxy the agent might still have to perform a TLS handshake with fleet-server (pass-through proxy) Right now the agent considers 3 entities for the control plane connections (consider all TLS config is set as a path to a file):
each of them can are evaluated in isolation and follow the precedence: policy > cli > empty/null/absent
which means, once set, any of them can only be changed, never erased
This allows the agent to be individually configured, having different agents with different TLS and proxy configuration in the same policy.
However, right now, once a mTLS proxy is configured, it's not possible to remove the mutual TLS configuration. In a scenario where the user adds a new proxy to the policy with one-way TLS, the agent will keep the client certificate configured. Which isn't an immediate problem as if the proxy does not require the agent to send its certificate, the agent will not send it. Therefore, the on-way TLS will work as expected.
The side effect is that if the client certificate (certificate, certificate key and key passphrase path) are a path and are deleted, the agent might not start up anymore as it'll look for the certificate (certificate, certificate key and key passphrase path).
I think it's more about we well defining and document what can and cannot be done.
We have the precedence working as expected: policy > cli > empty/null/absent
From the UI the proxy and its TLS configuration seems to be one thing, but on the agent side it is not. Also the agent does not differentiate between TLS for the proxy and directly to fleet-server.
The less intrusive is to keep it as is. However I believe we need to document clearly the implications of it and eventually, if needed, work to support removing proxy and TLS configurations thought the policy
From the UI the proxy and its TLS configuration seems to be one thing, but on the agent side it is not. Also the agent does not differentiate between TLS for the proxy and directly to fleet-server.
From a UI perspective, if you are adding a proxy and that proxy requires mTLS, then having the UI present all the TLS config together makes sense. It also is necessary to have a single atomic policy change for all of these parameters at once.
The UI below looks right in the context of adding a new proxy. This is assuming that the certificate and certificate key boxes are the client mTLS certificate and key. If they are, we should make that explicit.
I think generally if someone is using mTLS it is likely to be a universal requirement of their deployment and not a quirk of a specific proxy, so it is highly unlikely mTLS is going to be suddenly removed. That said, not being able to remove mTLS if it was enabled at install time is a problem that will limit our ability to debug issues.
I suspect the core problem here is that clearing the proxy client cert and key has no effect if agent was configured locally with a client cert and key.
It should be possible for Fleet to tell agent to ignore the install time cert and key. This is a more general problem for parameters we allow to be set at install time that is not specific to this particular issue.
Unless I am missing something, we should open a separate issue to allow giving Fleet complete control over the agent configuration regardless of what got set on the command line at install or enroll time.
From a UI perspective, if you are adding a proxy and that proxy requires mTLS, then having the UI present all the TLS config together makes sense. It also is necessary to have a single atomic policy change for all of these parameters at once.
It's already atomic in the sense the agent either applies all changes or none if the validation fails. However it'll apply the precedence "entity by entity".
I suspect the core problem here is that clearing the proxy client cert and key has no effect if agent was configured locally with a client cert and key.
or through the policy. Not being able to remove a previously set config the is the expected and current behaviour.
It should be possible for Fleet to tell agent to ignore the install time cert and key. This is a more general problem for parameters we allow to be set at install time that is not specific to this particular issue. Unless I am missing something, we should open a separate issue to allow giving Fleet complete control over the agent configuration regardless of what got set on the command line at install or enroll time.
That is the dream.
What I believe we need to decide is if, while we do not have this full control from fleet, the agent will treat proxy, CA, client cert as a single entity or keep the current behaviour.
It's my understanding it makes more send to make them all a single entity, which, given the proxy address has changed, would allow to remove CAs and the client certificate/key.
One thing is worth pointing out, the agent won't apply a configuration that causes loss of connectivity. Therefore, regardless the inconsistent state it might end up on, it'll still be able to reach fleet-server
It's already atomic in the sense the agent either applies all changes or none if the validation fails. However it'll apply the precedence "entity by entity".
When I said "atomic" I meant from the UI perspective, the mTLS configuration and the proxy URL should come in the same policy change action or agent won't apply it. The Proxy configuration UI is currently correct in grouping them all together.
What I believe we need to decide is if, while we do not have this full control from fleet, the agent will treat proxy, CA, client cert as a single entity or keep the current behaviour.
I don't fully understand this. The agent "entity" is an outgoing HTTP connection which may pass through a proxy and/or may use mTLS and/or may require additional certificate authorities to validate the certificate presented by the server it connects to. mTLS and the use of a proxy URL have no other relation, other than the fact that people may configure both a proxy and mTLS. The agent configuration needs no concept of a proxy entity right now.
The UI side of this models proxies as a single entity in a single page because people editing a proxy will want to consider the TLS configuration of the proxy. "Proxy" as an entity is a UI only concept here, and facilitates sending agent all of the required changes to support a proxy in one policy change action.
Version: 8.16, main
@AndersonQ does this mean we have to fix it asap in order to make it land in 8.16.1? Is this bad enough that we should block 8.16? Shall we already create a known issue?
@jlind23 No, we're fine:
I don't fully understand this. The agent "entity" is an outgoing HTTP connection which may pass through a proxy and/or may use mTLS and/or may require additional certificate authorities to validate the certificate presented by the server it connects to. mTLS and the use of a proxy URL have no other relation, other than the fact that people may configure both a proxy and mTLS. The agent configuration needs no concept of a proxy entity right now.
@cmacknz, by "single entity" I meant the agent would check if the proxy changed, if so, it'd apply the TLS configuration from the policy, even if it's empty.
As you said, it's what one would expect to happen given the current UI
Been thinking about this for a bit from a PKI standpoint.
(Assumption) Because Fleet is a central management for configuration of agents, we would expect the configuration received from Fleet to completely override any local configuration. This would mean "erasing" an option if it wasn't present in the fleet supplied policy, but present in the local policy (not what happens today).
Edge cases:
client tls private keys and certificates used in mTLS (proxy, enroll, etc). These SHOULD NOT be manged by fleet. This is because the client private key and certificate should be unique per host and a policy applies to a collection of hosts. Using the same private key on multiple hosts breaks the assumption that the key is "private". We need a local UI to update the private key & certificate to handle expiration/rotation.
certificate authorities used in mTLS SHOULD be managed by Fleet. Because these are used to verify the server identity, these could and should be managed centrally. The agent should persist the certificate authorities in it's local policy. This is because it may need them to connect to the fleet server on restart.
Proxy configuration. This COULD be managed centrally, but it would mean that all the agents in the policy use the same proxy configuration. This might not be a good assumption. The agent should persist the proxy configuration in it's local policy. This is because it may need the proxy to connect to fleet server on restart. If central management of proxy configuration is not a good assumption, then we need a local UI to update the configuration.
Well said @leehinman, we perhaps need an explicit way to annotate which parts of configuration can or must be per agent, because there is no way to centrally manage per agent values today. The mTLS private key and certificates are clearest example we've had so far. CC @nimarezainia The workflow we'd need to enable is rotating the mTLS private key of a single agent.
by "single entity" I meant the agent would check if the proxy changed, if so, it'd apply the TLS configuration from the policy, even if it's empty.
If I understand this right, this would be making an assumption in the agent that the UI will only ever update proxy and mTLS parameters together? I don't think agent should assume anything about how the UI will group changes together when there is not an explicitly defined API contract we can validate against (we don't have that here yet).
If I understand this right, this would be making an assumption in the agent that the UI will only ever update proxy and mTLS parameters together?
That would be the end result.
What we have today is trying to respect the separation Lee stated. "Proxy", "client cert and key" and "ca" are configured separately, but once set, they cannot be removed.
@cmacknz is the next step here do build out the necessary UI enhancement for managing per-agent configuration (e.g. mTLS private key and certificate)? Or is there an immediate path forward where we can resolve this issue entirely on the Agent side without needing an UI work as a pre-requisite?
@ycombinator I'm not Craig, but my take on it is the following:
We cannot do it fully on the agent side, not the desired final state:
client tls private keys and certificates used in mTLS (proxy, enroll, etc). These SHOULD NOT be manged by fleet. This is because the client private key and certificate should be unique per host and a policy applies to a collection of hosts. Using the same private key on multiple hosts breaks the assumption that the key is "private". We need a local UI to update the private key & certificate to handle expiration/rotation.
certificate authorities used in mTLS SHOULD be managed by Fleet. Because these are used to verify the server identity, these could and should be managed centrally. The agent should persist the certificate authorities in it's local policy. This is because it may need them to connect to the fleet server on restart.
Proxy configuration. This COULD be managed centrally, but it would mean that all the agents in the policy use the same proxy configuration. This might not be a good assumption. The agent should persist the proxy configuration in it's local policy. This is because it may need the proxy to connect to fleet server on restart. If central management of proxy configuration is not a good assumption, then we need a local UI to update the configuration.
What we can do on the agent side it to couple any proxy change with the CA and client tls private keys and certificates. So in the edge case where a CA and client cert were set though the CLI (during enroll) and a new proxy is set with absent CA and client cert, the agent would also remove the CA and client cert.
This is what can be done without UI changes. This would create a more consistent state, but goes against the behaviour that an nil or absent proxy/CA/cert would not remove an existing proxy/CA/cert.
I believe tying proxy+CA/cert together is the more consistent approach. That means, if the proxy changes, the agent would apply what ever CA/cert is in the policy, even if it means removing what was set by the CLI
There need to be two follow up issues from this discussion:
Neither of those block us releasing the mTLS feature today. We would release a UI that allows using a single client certificate for each agent, which while not ideal, is what we do and recommend for Logstash and satisfies a "must use mTLS for compliance purposes" compliance check box. https://www.elastic.co/guide/en/fleet/current/secure-logstash-connections.html
Without the two issues above introducing net new behavior, we cannot delete parameters set in the local agent info, only override them as Anderson says.
What I feel is important in the state of the system today is that the mTLS certificate, mTLS certificate key, proxy URL, and list of custom CAs can all be set and cleared (to potentially fall back to agent local values) from Fleet individually and independently. This ensures that any future UI changes we make will always work correctly because there is no assumption in the agent that these parameters all change together, even if they often will.
It's not clear to me that we have the last part for allowing all TLS and proxy parameters to change independently of each other.
It's not clear to me that we have the last part for allowing all TLS and proxy parameters to change independently of each other.
right now they are all bundled together under the Proxy UI.
Also the proxy UI only allows to set paths to certificate, key and CAs, it does not allow to set the certificates as the outputs allows. Some time ago I opened the bug for it https://github.com/elastic/kibana/issues/179269
Another mismatch in the UI is that it does not allow to set a passphrase for the certificate key.
The Elastic Agent can receive porxy and TLS configuration for fleet-server thorough its CLI, during install/enroll, and the policy. The configurations from the policy take precedence over the CLI, however and empty configuration from the policy does not changes the current configuration.
If the Elastic Agent is installed with a proxy using mTLS and the policy has another proxy configured with simple/one way TLS, the agent will apply the new proxy and CA, but keep the old certificate, certificate key and key passphrase/key passphrase path. Letting the agent on an inconsistent state. Such state might even lead to a failure to start if the old certificate-related configurations are files and the files are removed. If it happens, on start up, the agent will try to load those files and fail, preventing the agent from starting.
The culprit is https://github.com/elastic/elastic-agent/blob/0580e532c35171f7acd8738a9e9fd61d0c189eb1/internal/pkg/agent/application/actions/handlers/handler_action_policy_change.go#L201-L264
which handles the proxy, TLS Certificate and the TLS CA configurations as different entities, when they should be treated as one.
In other words, if either change, the proxy or the TLS certificate or the CAs, they all should be replaced. If there is any change in policyConfig.Transport, apply the whole new Transport. Right now as proxy, TLS certificate and CA are handled separately, if the CLI defines a proxy with mTLS and the policy has another proxy with one way TLS, the result config will be a mix of both instead of only the new proxy with one way TLS. The expected result is to have the new proxy URL+headers and the CAs, but the certificate and certificate key must be cleared.
For confirmed bugs, please report:
inspect
command