elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
113 stars 126 forks source link

[Elastic Agent] Provide a better mechanism to update the Certificates used by Agent #4557

Open nimarezainia opened 2 months ago

nimarezainia commented 2 months ago

Describe the enhancement:

During the life-cycle of a deployment the certificates used by the agent to establish TLS connections will inevitably expire and new ones need to be used. This issue is to discuss the best approach in providing support for this, and describing how a user would go about recycling the certificates they have on all their agents.

This may involve:

elasticmachine commented 2 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz commented 2 months ago

A few thoughts on how we could accomplish this:

In the Fleet UI you can only change the Fleet server hosts I believe, I don’t think we expose the TLS settings. So there’s no way to push new updates to agent remotely at all for TLS, you have to do it from the command line. We need to fix this.

Screenshot 2024-04-10 at 5 06 37 PM

As long as the certificates that changed aren’t in the system certifcate store I think we can reload on the fly by just breaking the connection to fleet and recreating it per https://github.com/golang/go/issues/35887

Also per https://github.com/golang/go/issues/35887 if the change the user needs to make is to the system certificate store then we need to restart, so being able to remotely restart agents would help here. I think this would also cover case 2 and reloading without restarting just becomes an optimization.

strawgate commented 2 months ago

Just to confirm, deployment the certificates used by the agent to establish TLS connections is this request related to client verification (mutual auth) -- i.e. a private key that exists on the client that Fleet creates and occasionally needs to be recreated? Or is it related to server verification -- updating the CA certification that Agent uses to verify that it's talking to a trusted Fleet server?

I assume it's just updating the CA agent uses to verify Fleet but just making sure

cmacknz commented 2 months ago

It is updating the CA agent uses to verify it is talking to a trusted Fleet server.

See https://github.com/elastic/ingest-docs/issues/167

strawgate commented 2 months ago

Ok yeah, sounds like we need to allow Fleet to update allowed CAs via Policy with a big warning message that says you can quite easily blow up your entire environment doing so?

cmacknz commented 2 months ago

you can quite easily blow up your entire environment doing so?

The approach we have taken with similar things like updating proxy settings and the soon to be supported mTLS settings is to have the agent test that it can still reach Fleet server before committing and making the configuration change permanent. Hopefully we can do something similar with the CA assuming both are usable at the time the switch is made.

nimarezainia commented 4 weeks ago

https://github.com/elastic/ingest-dev/issues/3443

nimarezainia commented 4 weeks ago

@cmacknz @strawgate question: if the CA is changed in this manner, does it affect the already established connections? isn't this CA used only during handshake? there's a mention o agent restarts which says otherwise.

strawgate commented 4 weeks ago

there's a mention o agent restarts which says otherwise

It sounds like the Agent loads certificates from the system store on startup.

This causes an edge case where the Administrator has recently loaded a new CA into the system store and the Agent loads a policy where the user is trying to use that certificate from the store.

The agent won't see the new certificate in the store unless it was restarted between the time the administrator added the certificate to the store and when it loads the policy that's pointing at the certificate.

If, upon updating the certificate used in the fleet policy, the Agent breaks its connection to Fleet and starts a new one (per @cmacknz's comment above) the Agent would attempt to start a new connection but would find the CA cert it's supposed to use is not in its in memory cache of the system store.

@cmacknz does this match your understanding?

cmacknz commented 4 weeks ago

It sounds like the Agent loads certificates from the system store on startup.

Unless custom CAs were configured, then we only load those. See discussion in https://github.com/elastic/ingest-dev/issues/3424 about changing this, or making it configurable.

If, upon updating the certificate used in the fleet policy, the Agent breaks its connection to Fleet and starts a new one (per @cmacknz's comment above) the Agent would attempt to start a new connection but would find the CA cert it's supposed to use is not in its in memory cache of the system store.

If, upon updating the certificate used in the fleet policy, the Agent breaks its connection to Fleet and starts a new one (per @cmacknz's comment above) the Agent would attempt to start a new connection but would find the CA cert it's supposed to use is not in its in memory cache of the system store.

@cmacknz does this match your understanding?

It matches what I expect to happen based on reading I've done, but this all depends on implementation details of the Go TLS implementation. The exact behavior will be easiest to confirm via testing.

nimarezainia commented 3 weeks ago

in context of https://github.com/elastic/ingest-dev/issues/3443 , (where we want to provide the UI for the user to easy swap these CA's certs), having a UI is somewhat superfluous unless changing it causes a reset of the connection at the agent that receives this new CA.

Further, we have this same option on the Elasticsearch output of *beats, see: https://www.elastic.co/guide/en/beats/filebeat/current/securing-communication-elasticsearch.html :

output.elasticsearch: hosts: ["https://myEShost:9200"] ssl.certificate_authorities: - /etc/pki/my_root_ca.pem - /etc/pki/my_other_ca.pem ssl.certificate: "/etc/pki/client.pem" ssl.key: "/etc/pki/key.pem"

Here user can configure a different CA. There's no mention of resetting or restarting filebeat (perhaps we need to also address this doc section). The same yaml can be applied in the "Advanced yaml" section of the Elasticsearch output I believe. So we need to definitely confirm if a restart is required.

My preference is to reload without needing a restart

fyi @AndersonQ as you have been looking into this area lately.

nimarezainia commented 3 weeks ago

For now i think we should have an issue for testing these theories here to determine what level of work is required to fix this. We can in parallel pursue the UI efforts.

nimarezainia commented 3 weeks ago

@belimawr did mention this on a thread: https://github.com/elastic/beats/blob/3102b496b9e9f0eae8c7eb685b1217734d40190b/filebeat/filebeat.reference.yml#L1710-L1716 - that beats will restart if the certificates change

cmacknz commented 3 weeks ago

Agent restarts Beats (not all inputs, just Beats) automatically when any output parameter changes because of a historical bug in output hot reloading that hasn't been investigated/resolved.

Ideally we wouldn't do this.

belimawr commented 3 weeks ago

@belimawr did mention this on a thread: https://github.com/elastic/beats/blob/3102b496b9e9f0eae8c7eb685b1217734d40190b/filebeat/filebeat.reference.yml#L1710-L1716 - that beats will restart if the certificates change

This option is disabled by default.

As Craig said any change on output parameters will restart a Beat. It's a implementation detail, but the Beat restarts itself, the Elastic-Agent just sends the new config and the Beat decides what to do.

This behaviour is configurable (https://github.com/elastic/beats/blob/3c9f4d952bfd20b1898cfeb59916a2239b667988/x-pack/agentbeat/agentbeat.spec.yml#L74-L75), as it is required for the Beat to function correctly, it is always enabled.