elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 134 forks source link

[Fleet] Solution to avoid duplicate agents #1277

Open naj-h opened 2 years ago

naj-h commented 2 years ago

TL;DR: We need a way to properly unenroll an agent from command line on the client side, to be able to enroll it again without creating duplicates.

The use-case : an agent is having some issue, it shows as updating from the Fleet UI. Naturally, one tries to enroll the same agent again from client's command line.

For this:

The goal here is to be able to re-enroll the agent without it creating duplicates.

So the only option we have left would be to go to Fleet UI, unenroll the agent. And once successfully unenrolled, then only we can try to enroll it from agent command line. The process of having to go through Fleet UI to do this can be considered cumbersome.

I think the solution to this problem could take multiple forms, and one of them would be to add an option to the enroll/install commands where the user can decide what to do in case an existing agent is already found: (either create a duplicate, unenroll the existing one and enroll again, or don't enroll and continue).

nash-sprd commented 2 years ago

This is a much needed feature. We have ephemeral hosts (autoscaled), and it does not make sense to have yet another agent every time a host is added. The agent must also allow automated removal from the fleet (in case of scale down)

zez3 commented 2 years ago

@joshdover ?

zez3 commented 2 years ago

@jamiehynds is this already on a roadmap?

joshdover commented 2 years ago

The Unenrollment timeout setting in the agent policy is our current solution to this, though it's not ideal. An automated unenroll during shutdowns / SIGTERM could be possible to add, though of course if agent receives a SIGKILL, we may not have an opportunity to send the unenrollment so it won't be perfect but probably good enough most of the time.

cmacknz commented 2 years ago

In addition to having a way to avoid duplicated agents when reinstalling (if desired), we should also make sure that the underlying integration state can be preserved between installs when we use re-installing as a corrective action (e.g. failed upgrades). We have a way to mount the integration state like Filebeat registries separately using the STATE_PATH environment variables for containerized agents, but nothing for agents on a single host.

MakoWish commented 1 year ago

Any traction on this?

cmacknz commented 1 year ago

We are actively looking at a way to allow the same agent to re-enroll without creating a duplicate agent, but to solve a related but slightly different problem. I think the solution in https://github.com/elastic/fleet-server/issues/2254#issuecomment-1538833737 would probably address this as well.

MakoWish commented 1 year ago

Thank you, Craig, but the "write a UUID to disk" solution discussed in that issue would not resolve our situation. We have ~5,500 devices on our network, and they are often pulled back for reimaging for one of a few reasons.

One of those cases is when a team member leaves the company, the computer they were using is pulled back, reimaged, and staged for deployment to a next hire. Since our devices are named according to their asset tags, the computer name remains the same. Because of this, reimaging the device, and subsequently reinstalling Elastic Agent, a duplicate will be created in Fleet. Since the device was wiped, a UUID being written to disk would also be wiped.

cmacknz commented 1 year ago

Thanks, in that case you would need the UUID to be some permanent attribute of the device like the computer name. That use case can probably be solved by allowing for alternate sources for the UUID we would generate, the rest of the logic in that solution should still apply

MakoWish commented 1 year ago

Perhaps using the system's UUID? This would only change with hardware changes, so it should remain the same even after reimaging.

Windows: WMIC CSProduct GET UUID Linux: dmidecode -s system-uuid

MakoWish commented 1 year ago

Also for Mac:

ioreg -d2 -c IOPlatformExpertDevice | awk -F\" '/IOPlatformUUID/{print $(NF-1)}'

MakoWish commented 1 year ago

Any more progress on this one? This is one of the things preventing us from rolling out Elastic Agent to the masses. Manually finding and deleting duplicates any time a computer is reimaged (regular occurrence here) is not something I want to sign up for.

cmacknz commented 1 year ago

The work in https://github.com/elastic/elastic-agent/issues/2820 is progressing but not completed yet. The Fleet server side has already been done and validated in our internal scale testing.

This is moving, but not very fast since it isn't one of our highest priorities.

nimarezainia commented 1 year ago

One of those cases is when a team member leaves the company, the computer they were using is pulled back, reimaged, and staged for deployment to a next hire. Since our devices are named according to their asset tags, the computer name remains the same. Because of this, reimaging the device, and subsequently reinstalling Elastic Agent, a duplicate will be created in Fleet. Since the device was wiped, a UUID being written to disk would also be wiped.

@MakoWish what version of software are you running? In the case above (in the latest releases) you should not see a duplicate agents in a "healthy" state. It's true that there will be multiple agents with the same host name in fleet if the recycling happens as you describe it. These agents however are different from the systems view and have different agent ids. Perhaps you could verify this for us.

Once the old machine is decommissioned, Fleet Server will not receive any checkins from it. that agent will eventually show up as offline and go inactive(after a certain timeout). That instance of the agent will be filtered away from the main fleet view. Still there if someone needs to do some analysis but shouldn't be cluttering your view.

You can set the inactivity timeout as described here: https://www.elastic.co/guide/en/fleet/current/set-inactivity-timeout.html if you need to expedite transition from Offline --> Inactive

The state diagram HERE also describes the above transitions.