[Feature Request] Have Elastic Agent send a final message to its fleet server when making changes

aarju commented 2 years ago

Describe the enhancement: When the elastic-agent enroll or the elastic-agent uninstall commands are run the binary should send a final message to the current fleet server before enrolling with the new fleet server. The fleet server can use this command to change the status of the agent in fleet and to notify admins if an agent was unexpectedly uninstalled. Currently when the agent makes changes the Fleet server is unaware of the changes and this will result in lots of identical agents that are no longer active and the admins do not know which one is still the active agent

Describe a specific use case for the enhancement or feature:

One of the governance requirements for multiple compliance frameworks such as Fedramp or PCI is that we have to have alerting in place for when an endpoint security agent stops running. This feature would help bring Elastic Agent into compliance without the need for a separate auditbeat process to monitor for the agent removal.

This would also help keep fleet servers clean in a devops environment where agents are managed via code.

Link to RFC

https://docs.google.com/document/d/1gYbsGfvjc7NhbURwYNqEl25ouar81nZ_8bkpsi0Dc6Y/edit

### Tasks
- [x] fleet-server `/api/fleet/agents/:id/audit/unenroll` api
- [ ] elastic-agent uses fleet-server's audit/unenroll api when uninstalling
- [ ] feature integration tests
- [ ] fleet-ui to query for `Detached Endpoint` an show a non-offline state OR Endpoint uses `/api/fleet/agents/:id/audit/unenroll` api to update fleet
- [ ] fleet-ui FORCE_UNENROLL agent annotations
- [ ] scale test `/api/fleet/agents/:id/audit/unenroll` API endpoint

### Future Work
- [ ] fleet-ui improve audit logs
- [ ] fleet-ui FORCE_UNROLL action RBAC
- [ ] Endpoint provides `verify no-tamper-protection` command
- [ ] elastic-agent check Endpoint's `verify no-tamper-protection` command before executing `enroll -f` and `upgrade` invocations and `POLLICY_REASSIGN` actions

nimarezainia commented 2 years ago

@aarju could you please elaborate on the issue with an example workflow that created the problem for you (and indeed if this is a bug that needs to be addressed). I don't see how say on an enroll we should be sending a "final" message to the current fleet server.

there seem to be a lot of duplicate agents in the screenshot attached. Could you explain the workflow that got you there - perhaps that is the issue we need to address.

We are in the process of enhancing the status reporting on the agent - including when the integrations have an issue and are not operating in the healthy status. Once this is available the users will be able to build alerts when the status on the agent changes. That should address the actual concern raised here regarding notifications on status change.

aarju commented 2 years ago

@nimarezainia when agent runs an enroll command on an agent that is already connected to a fleet server it should send a final message to the current fleet server letting it know that a new agent id will be assigned and the current agent id is being removed. If you run the exact same elastic-agent enroll command multiple times it will get a unique agent id each time and cause multiple stale agents to show up as offline in fleet. The Fleet server has no way of knowing that those agents were re-enrolled with a new ID. From the fleet point of view they are just offline and may come back some day.

The workflow that resulted in these duplicate agents came from testing of using jamf to install agent and testing how the jamf script will handle things like agents that are already installed but talking to a different fleet server, agents that are installed but unhealthy, and agents that are running an older version than the currently deployed version. Since elastic-agent deploys our Endpoint Security for workstations we decided that it was more important to guarantee that the agent is running and we can deal with the duplicate agents. Because we are aggressive about making sure a healthy agent is running we are seeing the occasional duplicate agent due to a reinstall or reenroll.

From a security point of view the scenario that is most concerning to me is the insider threat or hacker with admin rights using the elastic-agent enroll command to bypass all of our XDR protections and logging. The hacker could enroll the agent in their own fleet server with all protections disabled, do anything they want, and we would never know that it had happened. The jamf script to deploy agent runs every 24h so it would be a considerable blind spot.

aarju commented 2 years ago

What would be nice to have is:

When elastic-agent enroll is run it checks to see if there is already a fleet server configured and if so it sends a final message show that it is being enrolled again
If the new fleet server is identical to the old server then it should keep its current agent id to keep from creating a duplicate agent in fleet
If the new fleet server is different to the current server then the URL of the new server should be included in the event sent to the old server
The fleet server should have logic to handle these events and tag agents that have been enrolled into a new server. We can also create security alerts on them
When the elastic-agent uninstall command is run that should send an unenroll event to the fleet server before starting the uninstall process. The event should contain information about the user and process that ran the command.

nimarezainia commented 2 years ago

@aarju it is correct that upon a new enrollment - we consider that as a new agent instance and treat it as such. Your concern is not so much the "offline" agents that are hanging around but this comment:

"from a security point of view the scenario that is most concerning to me is the insider threat or hacker with admin rights using the elastic-agent enroll command to bypass all of our XDR protections and logging. The hacker could enroll the agent in their own fleet server with all protections disabled, do anything they want, and we would never know that it had happened. The jamf script to deploy agent runs every 24h so it would be a considerable blind spot."

how would the hacker enroll the agent into their own Fleet Server? the bigger issue worth addressing is how does the hacker obtain their own fleet server that is connected to ES/Kibana (they need service tokens, right creds etc). if there's a loop hole here we should address that.

aarju commented 2 years ago

My big concern is that the fleet is not aware of commands and actions that happen on the agent if those actions cause the agent to stop communicating with fleet. There are a lot talks and Blog posts at security conferences where Red Team and Hackers trade notes on how to disable various XDR/EDR agents in order to cover their tracks. For example, last month in Munich this talk was given about different complex ways to disable EDR after popping a shell on a host. With elastic agent you don't need any of those complex techniques, you just need to enroll it in your own fleet server or ask it nicely to uninstall and the defenders will never know.

how would the hacker enroll the agent into their own Fleet Server? the bigger issue worth addressing is how does the hacker obtain their own fleet server that is connected to ES/Kibana (they need service tokens, right creds etc). if there's a loop hole here we should address that.

When I say their own fleet server I mean the hacker creates their own completely separate Elastic stack that they control and then have elastic agent connect to their stack. They could even create a free trial account in Elastic Cloud to use for their hacks.

peasead commented 1 year ago

Great idea, @aarju

To add some additional context, we ran this exact scenario during a recent ON Week in Protections.

In a lab environment, we used a phishing email with a macro-enabled Word document to run the elastic-agent.exe enroll command. From this, we were able to enroll the targeted machine in our own Fleet server that we stood up. This allowed us to remove visibility from the targeted machine's infosec team and use the Elastic Stack as an implant management framework.

nimarezainia commented 5 months ago

@aarju do you still see this issue as a concern given we have tamper protection available on the agent? there should be no unauthorized manipulation of the agent.

Currently there's no way to differentiate between agents that where moved from one Fleet to another due to malicious activity and those which have legitimately go offline (say someone going on holidays for 2 weeks). The crux of this issue is to identify agents that are illegitimately uninstalled or enrolled into a different Fleet. Tamper protection is somewhat protecting us from that event.

cc: @pierrehilbert @cmacknz

WiegerElastic commented 5 months ago

@aarju do you still see this issue as a concern given we have tamper protection available on the agent? there should be no unauthorized manipulation of the agent.

Currently there's no way to differentiate between agents that where moved from one Fleet to another due to malicious activity and those which have legitimately go offline (say someone going on holidays for 2 weeks). The crux of this issue is to identify agents that are illegitimately uninstalled or enrolled into a different Fleet. Tamper protection is somewhat protecting us from that event.

cc: @pierrehilbert @cmacknz

There are still scenario's in which an Agent might be deinstalled, even with tamper protection enabled. For example, an IT department might want to enable folks to do a (re)install of Agent for whatever reason (through Jamf or Intune). It would still be nice to know when this happens from a Fleet perspective.

I could also imagine that not all our customers can or want to enable tamper protection and would still benefit from these messages.

cmacknz commented 5 months ago

With elastic agent you don't need any of those complex techniques, you just need to enroll it in your own fleet server or ask it nicely to uninstall and the defenders will never know.

This should now be prevented with tamper protection.

When the elastic-agent enroll or the elastic-agent uninstall commands are run the binary should send a final message to the current fleet server before enrolling with the new fleet server.

One of the governance requirements for multiple compliance frameworks such as Fedramp or PCI is that we have to have alerting in place for when an endpoint security agent stops running. This feature would help bring Elastic Agent into compliance without the need for a separate auditbeat process to monitor for the agent removal.

Given the compliance requirement, I assume this can't be best effort. That is, if agent is uninstalled, it must notify Fleet Server. It can't send a message which may or may not be processed by Fleet before the uninstall happens.

In the case of uninstall, we would have to have a mode that prevents uninstall until Fleet has been notified the uninstall has completed. This would prevent uninstall unless the target machine is online.

We have talked within the team about creating a separate watchdog service to monitor agent, like a permanently running instance of our upgrade watcher, but with Fleet connectivity. I think this would be what we'd need to solve this correctly, essentially just making a separate auditbeat instance a mandatory part of the installation. In this type of solution the primary agent process would get uninstalled as requested, but the watchdog wouldn't remove itself until it notified Fleet about what is happening.

For unenroll it is a bit easier, as the agent is still going to be running. It can just keep attempting to notify the old cluster in the background.

The suggestions around de-duplcating agents re-enrolled to the same cluster are a separate and less complex problem to solve. I'd create a separate issue just for that as the solution there doesn't overlap with the idea of a final message at all. Potentially it can be solved in Fleet if we fingerprint the agent host machine and start keeping a history of what the agent did on that machine or grouping agents we think are the same machine together.

peasead commented 5 months ago

Are tamper protection events logged anywhere?

nimarezainia commented 5 months ago

The suggestions around de-duplcating agents re-enrolled to the same cluster are a separate and less complex problem to solve. I'd create a separate issue just for that as the solution there doesn't overlap with the idea of a final message at all. Potentially it can be solved in Fleet if we fingerprint the agent host machine and start keeping a history of what the agent did on that machine or grouping agents we think are the same machine together.

Things have changed slightly since this issue was created. Those duplicates will show offline and then become inactive (user can adjust the timer but I believe the default is 7 days). Inactive agents are not shown in the default view. We have an open issue to create another timer to automatically wipe clean the inactive agents if the user wishes so.

nimarezainia commented 5 months ago

Are tamper protection events logged anywhere?

@nfritts @roxana-gheorghe would you know ^^ ?

nimarezainia commented 5 months ago

We have talked within the team about creating a separate watchdog service to monitor agent, like a permanently running instance of our upgrade watcher, but with Fleet connectivity. I think this would be what we'd need to solve this correctly, essentially just making a separate auditbeat instance a mandatory part of the installation. In this type of solution the primary agent process would get uninstalled as requested, but the watchdog wouldn't remove itself until it notified Fleet about what is happening.

@cmacknz Can the Check-in message be utilized for this function? a general purpose section in the check-in that would require an acknowledgment from Fleet. Most of the time there won't be anything in there. in this case Agent would wait until ack is received from Fleet before proceeding.

cmacknz commented 5 months ago

Can the Check-in message be utilized for this function? a general purpose section in the check-in that would require an acknowledgment from Fleet. Most of the time there won't be anything in there. in this case Agent would wait until ack is received from Fleet before proceeding.

We could maybe reuse the checkin message to do the exchange, but the core problem with sending a final message reliably is the uninstall case. Once uninstalled there is nothing left to do the check in.

I don't like blocking uninstall until you check in with Fleet one last time as a solution because it creates the potential for accidentally unremovable agents. There are ways to deal with this, they are just more complicated.

michel-laterman commented 2 months ago

From what I understand of this issue we want to nudge fleet to unenroll an agent (either with an UNENROLL or FORCE_UNENROLL) in two specific scenarios:

elastic-agent uninstall command is ran - agent should do a best-effort message to fleet-server best-effort attempt would also occur if install -f is used tamper protection should reject an invalid uninstall (#4506 will stop install -f from orphaning endpoint)
elastic-agent enroll -f command is ran - i'm unclear if we want this as best-effort or not

If we don't want to block on an uninstall then I think the nudge to fleet-server should make fleet-server insert a FORCE_UNENROLL action into the .fleet-actions index to be handled by the fleet-ui. In this case the agent would just log if there was a request error/non-200 status code.

If we want to try to get all data sent from the agent, the nudge can result in an UNENROLL action that the agent recieves and tries to execute and ack before the uninstall progresses (or a timeout is reached)

I think it's better to introduce a new endpoint for this (such as DELETE /api/fleet/agents/:id) where the agent's API key is required instead of adding additional attributes to the checkin body.

cmacknz commented 2 months ago

From the description:

we have to have alerting in place for when an endpoint security agent stops running

It's not about nudging, it's about alerting when an agent is uninstalled or unenrolled, which is required for compliance with certain standards organizations.

Since this is tied to compliance, it's can't be best effort, it has to have "at least once" guarantees. The feature can't be "maybe send alert to Fleet" it has to be "always send alert to fleet". This is where the complexity comes from.

michel-laterman commented 2 months ago

ok, I think then as a start the elastic-agent uninstall and elastic-agent enroll -f should both force a running/installed agent to use the a DELETE /api/fleet/agents/:id endpoint to tell the fleet-server that an agent is being removed as a blocking call; we can mark the agent document using the agent.Unenrollment* attributes or introduce a new attribute. This call to fleet-server must take place after the elastic-agent checks if tamper protection is enabled (and the other checks we do in uninstall) so it does not notify fleet-sever that an agent will be deregistered but fails to do so.

If the agent is running, do we want to send an UNENROLL action to it (and await that action ack) or should we just invalidate api keys/use a FORCE_UNENROLL action?

If we wanted to add an escape hatch for the user to uninstall the agent with a single command (and not manually remove the agent's install dir and de-register services etc), we can introduce a new flag to the uninstall command (i.e., --skip-server-notification); Would we need the fleet-ui component to periodically check if an agent is offline for a while without any unenroll reasons, or would this clash with legitimate use cases (shutting down laptop for a vacation for example).

I'll start working on an RFC for this

cmacknz commented 2 months ago

We can introduce a new flag to the uninstall command (i.e., --skip-server-notification)

From a compliance perspective this doesn't work. Users that want this feature will not want it to be possible for it to be trivially disabled. One of the use cases here is that the end user wants a notification that a user with root privileges removed agent from their machine. The best example of this would be an Elastic engineer temporarily removing the InfoSec managed agent from their machine. InfoSec will want a notification that this happened.

It is probably best to think of this as an optional feature of the agent policy, similar to tamper protection. Potentially it should be part of tamper protection.

You will need to handle the case where we want to uninstall, but the network is down. A user should not be able to bypass the audit notification or hang the uninstall of the agent by temporarily turning off wifi on their machine.

cmacknz commented 2 months ago

When the elastic-agent enroll or the elastic-agent uninstall commands are run the binary should send a final message to the current fleet server before enrolling with the new fleet server. The fleet server can use this command to change the status of the agent in fleet and to notify admins if an agent was unexpectedly uninstalled. Currently when the agent makes changes the Fleet server is unaware of the changes and this will result in lots of identical agents that are no longer active and the admins do not know which one is still the active agent

We've started designing how we could guarantee that you get a notification and it isn't simple to do. We were wondering if we could change the solution to instead remove the need to guarantee a notification.

@aarju If we supported tamper protection of the elastic-agent process, and tamper protected the enroll command, so that only trusted users could perform the enroll or uninstall operations, would the need for a notification still exist? Within Elastic, this means users with root or admin privileges on their machines would no longer be able to manipulate the InfoSec Elastic Agent at all without the uninstall token. Today we only tamper protect the endpoint-security process.

To implement a reliable notification for uninstall or enroll we would likely have to add yet another service whose job it is to perform these notifications, but that service not being tamper protected means a privileged enough user could just stop it if they knew it existed.

aarju commented 2 months ago

@cmacknz I think that adding tamper protection to the agent process would be a good solution. I still think a 'best effort' final log event letting us know that an enroll or uninstall command was run would be nice, but then it wouldn't have to be 'guaranteed' to meet the regulatory requirements. A process that follows the normal logging path prior to running the enroll or uninstall commands would be good enough.

cmacknz commented 2 months ago

Thanks, being able to relax the notification to best effort simplifies the implementation significantly. We should be able to do that along with tamper protecting uninstall+enroll for agent itself.

ycombinator commented 1 month ago

Thanks, @michel-laterman, for driving the tech definition for this issue along with inputs from the Endpoint team and other engineers. And thank you for updating this issue's description with a task list for concrete next steps.

@nimarezainia I'm re-assigning this issue to you for product prioritization, presumably in consultation with the Endpoint team as there's some work to be done on their end as well.

nimarezainia commented 1 month ago

@ycombinator I don't think the prioritization changes in this case. This issue as it stands should be addressed in some fashion. Expanding tamper protection was one idea, it seems to be a lot more risky and a lot more involved.

Aside from tamper protection, I think a best effort, non-blocking log message to indicate the agent was being uninstalled or unenrolled to a new fleet server would be a good approach here. let me know if that makes sense.

michel-laterman commented 1 month ago

@intxgo and I had a meeting to discuss concerns about the proposal. He said it may be possible to have a uninstall/orphaned message sent from Endpoint as part of an existing notifications channel and will investigate and update the issue with findings

intxgo commented 1 month ago

This issue is quite old. I totally agree with addressing the problem, but not necessarily agree with proposed solution as since then a lot has changed.

To begin with it seems the problem does not exist when Agent is unenrolled from Kibana, only when that's made on the endpoint from cmd. Here we have the following situations:

Agent without Endpoint.
Agent with Endpoint, not Tamper Protected
Agent with Endpoint, Tamper Protected

Case 1. I agree that Agent should attempt to notify fleet about uninstall or re-enroll to avoid the ghost Offline entry in Kibana. It'd be a nice addition.

Case 2. Broken link Kibana to Endpoint seems to be the main concern. However since we have Tamper Protection capability, we shouldn't over complicate things here. For sure it'd be good if Agent continue to report about it's changes as in Case 1. However bear in mind that an admin user can simply stop the services, Agent and Endpoint, just delete them, etc. A trivial bypass is possible anyway.

Case 3. In this mode only Endpoint is Tamper Protected so Agent must always seek approval from Endpoint first before making any action. Commands:

elastic-agent uninstall
elastic-agent enroll

should always check with Endpoint first (regardless whether -f is appended or not) if it's Tamper Protected, requiring valid --uninstall-token [token] to proceed with the action (this might be difficult now at Agent side, that's why I proposed to add Endpoint command verify no-tamper-protected to easily pre-check the condition). With valid uninstall token, Agent notifying Fleet about the change, as in Case 1, seems enough to have consistent view in Kibana. As a matter of fact, Endpoint already makes an effort to notify the stack about being uninstalled https://github.com/elastic/endpoint-dev/pull/13405 . This information can be used at the stack side by the Fleet. We assume here that using a valid uninstall token is a sufficient security boundary which doesn't need special alerting on Kibana side, but naturally Agent can include the fact of using valid uninstall token in it's notification to fleet.

Nonetheless, Agent is completely vulnerable to admin, so we can easily have a situation when it's gone, removed by malicious actor, leaving Endpoint orphaned (with no link to Kibana). In this case, the malicious actor should not be able to tamper with Endpoint's config due to policy signature. Their fake stack won't have the same key to sign the policy and/or actions. In short, Endpoint itself should not be taken over, we should address any security bug here. Even right now, if Agent gets abruptly re-enrolled to different stack, Endpoint keeps protecting the host with last config as the new stack can't override the policy through Agent.

I agree that the status Offline in Kibana is very wrong when Endpoint service is not offline. There is a number of customers who want only Elastic Defend, therefore the Agent in the middle is completely irrelevant detail to them. Currently Endpoint "checks in" with Fleet via Agent. In case of broken connectivity between Agent <-> Endpoint the stack should look at other channels to check if Endpoint is running or not to be able to display something like Detached Endpoint status instead of Offline in Fleet view, Detached instead of Inactive in Security -> Endpoints view. If querying for incoming events from Endpoint is not a feasible solution here, then Endpoint should "check in" with fleet directly when it can't talk to Agent, calling periodically the proposed orphaned API, or something else.

Summary

In my opinion, notification about disconnecting Agent from current fleet from command line, by uninstall, enroll, etc, is a nice addition but does not require special *hacks as Agent is totally vulnerable to admin. Tamper Protection ensures Endpoint can't be uninstalled nor trivially taken over by admin, but Agent is a weak link in the chain, so there should be a backup direct channel between Fleet <-> Endpoint to show in Kibana that some Endpoints are still enrolled to the current fleet but are detached.

`*hacks - because Endpoint installer doesn't see if it's being invoked by Agent by action originating from the stack or from the command line, does not support force flags. Adding new parameters around here, etc, would only complicate already very complex Tamper Protection flow, see https://github.com/elastic/endpoint-dev/pull/14268 , I hope it's clear why I do not recommend complicating it further

michel-laterman commented 1 month ago

I've added task lists to the issue description.

For the basic implementation of this issue we'll add the new API to fleet-server, and call it from the elastic-agent on uninstall (best effort with limited retries), after component uninstallation is successful. We'll need the fleet-ui to show the new status based off the new attributes, and annotate the agent record if it's targeted by a FORCE_UNENROLL action.

For detecting orphaned Endpoints we can take one of two approaches

have the fleet-ui look at an alternate channel to see if Endpoint is active when the supervising elastic-agent instance is offline; i'm not sure what the actual channel here would be, or how feasible it would be for fleet-ui to implement.
have Endpoint use the new API directly when it detects that it's orphaned. For this case the semantics of a successful checkin may need to change to clear the attributes that indicate an orphaned status if the agent checks in after it is set.

The other tasks from the RFC, mainly Endpoint providing a verify no-tamper-protection command for the elastic-agent to use before certain operations will be set as future work.

cc @kpollich

michel-laterman commented 1 month ago

@ycombinator and I had a brief discussion, we're leaning towards option 2 in my comment above.

I have a PR up to add the new endpoint to fleet-server.

In order to support the second approach we will need to change the behaviour of the checkin API to remove the audit_unenrolled_reason, audit_unenrolled_time, and unenrolled_at attributes on a successfull checkin.

Does anyone have objections to this approach? cc @kpollich @cmacknz @intxgo

intxgo commented 1 month ago

Orphaned Endpoint is not a normal state so I'm leaning towards explicitly using the new API /api/fleet/agents/:id/audit/orphaned to update fleet as it seems easier to test overall. Endpoint can call it periodically (TBD 30 seconds? I'm thinking about matching Agent check-in interval), until the situation is resolved.

michel-laterman commented 1 month ago

The /api/fleet/agents/:id/audit/unenroll endpoint is currently designed to reduce load on ES; as it stands today it will return a 409 if it has already been called previously. What purpose would calling every 30s do for the orphaned case?

intxgo commented 1 month ago

The /api/fleet/agents/:id/audit/unenroll endpoint is currently designed to reduce load on ES; as it stands today it will return a 409 if it has already been called previously. What purpose would calling every 30s do for the orphaned case?

To show on the UI seconds since last Detached Endpoint status update, an indication that the machine is online. Without it we will only know that the installation is broken on that host from the Fleet UI status. I see, this API is hooked differently than Agent check-in. The state is going to be kept until the Agent connects successfully to clear it. I don't have any strong opinion on this. The orphaned state is not a normal state so if it requires too much changes to hook it in place of Agent check-in, I'm fine with Endpoint making only one successful call. The user can always go to Security section to query for incoming events from Endpoint to determine if the machine is online or offline.

elastic / elastic-agent

[Feature Request] Have Elastic Agent send a final message to its fleet server when making changes #484

Link to RFC

Summary