[Fleet] Investigate side-effect of a space deletion

nchaulet commented 2 months ago

Description

as Fleet moving to be space-aware, some entities (agents, agent policies, uninstall token) will become space aware, we should investigate the effect of a space deletion, and the potential way to recover from it if there is an issue.

@elastic/kibana-security is there any hook available to react on a space deletion? maybe to clean things or to prevent if there enrolled fleet-agents

elasticmachine commented 2 months ago

Pinging @elastic/fleet (Team:Fleet)

legrego commented 2 months ago

@elastic/kibana-security is there any hook available to react on a space deletion? maybe to clean things or to prevent if there enrolled fleet-agents

We do not expose a hook today, but we can explore adding one (or something like it) if you can provide a set of detailed requirements.

For reference, the logic to delete a space is defined here: https://github.com/elastic/kibana/blob/c89ee65c7034ba26006e2d426156a6de11b3505f/x-pack/plugins/spaces/server/spaces_client/spaces_client.ts#L187-L196

This delegates to the deleteByNamespace function of the Saved Objects repository, which deletes saved objects belonging to the space, or "unshares" objects from the space if an object exists in multiple spaces: https://github.com/elastic/kibana/blob/c89ee65c7034ba26006e2d426156a6de11b3505f/packages/core/saved-objects/core-saved-objects-api-server-internal/src/lib/apis/delete_by_namespace.ts#L25-L83

nchaulet commented 2 months ago

Thanks @legrego the issue for us is we are introducing space to non saved object documents, and those document will become orphans if the space is deleted.

@nimarezainia what will be the ideal behavior here? a way to block space deletion when we have active agents in that space? some migration to the default space?

nimarezainia commented 2 months ago

@legrego would you know what happens to other kibana assets in a space when that space is deleted? is there a warning of any sorts?

@nchaulet I don't know if we should make a decision on user's behalf in this regard (as in moving all to default space) Ideally we can detect that an agent policy is associated with the space being deleted and block the space deletion until all agent policies are moved out of the space or deleted. I think the admin who has the right access to delete the space then could make a decision on what should happen to the agent policies. Presumably this persona has a higher level of access.

legrego commented 2 months ago

would you know what happens to other kibana assets in a space when that space is deleted? is there a warning of any sorts?

All saved objects within the space are deleted, or removed from the space. Any other assets are left untouched. We show a warning when deleting a space that all saved objects will be removed.

cmacknz commented 2 months ago

Ideally we can detect that an agent policy is associated with the space being deleted and block the space deletion until all agent policies are moved out of the space or deleted. I think the admin who has the right access to delete the space then could make a decision on what should happen to the agent policies.

+1 this seems like the best way to deal with this, but reading the prior discussion I don't think there is a way to implement this today.

The core problem is there are Elastic Agents that continue to exist outside of a deleted space that become unmanageable or in the case of Defend potentially uninstallable if the uninstall token was deleted along with the space (CC @ferullo).

nimarezainia commented 2 months ago

@legrego Looks like ideally we would need a hook in that space deletion path. Perhaps a way for other users (such as Fleet) to register their dependency on Spaces. Also the deletion to be halted if any of the registered functions indicate it shouldn't be deleted. What would you need from us on this to move forward? I'd imagine this affects almost everyone who has Space dependency.

@kpollich @nchaulet this is probably a blocker for our project. What do you think?

legrego commented 2 months ago

Looks like ideally we would need a hook in that space deletion path. Perhaps a way for other users (such as Fleet) to register their dependency on Spaces.

Is this solely in support of the Also the deletion... clause below, or is there other functionality that you need this registration to support?

Also the deletion to be halted if any of the registered functions indicate it shouldn't be deleted.

Preventing space deletion is an aggressive measure and isn't something I can agree to without broader consideration (cc @rayafratkina @mwtyang @azasypkin @lukeelmers). I see benefit to warning users if Fleet indicates that other assets are impacted/degraded by the operation, but I'm not yet sold on preventing deletion.

we are introducing space to non saved object documents, and those document will become orphans if the space is deleted.

Is there a list of these non-SO assets that we can see to help guide our decision making? It would be helpful to understand: 1) How these assets are created 2) Who/what creates these assets 3) What privileges are required to CRUD these assets 4) Where these assets reside (e.g if someting is stored in a Fleet system index, Kibana system index, or is an implementation detail of ES, etc.)

nchaulet commented 2 months ago

Is there a list of these non-SO assets that we can see to help guide our decision making? It would be helpful to understand:

Sure I can provide this

.fleet-enrollment-tokens the enrollment token for an agent policy created by a user from Kibana Fleet:Agents:All privileges to access
.fleet-policies Created by a user from Kibana not readable from Kibana
.fleet-agents the record for an agent policy created by fleet-server, readable from the UI with Fleet:Agents:Read privileges
.fleet-actions .fleet-actions-results created by a user from Kibana and from fleet-server readable from the UI with Fleet:Agents:Read

cmacknz commented 2 months ago

Fleet is a remote management UI. The biggest non-shared objects I am concerned about are Elastic Agents, which live completely outside of the stack.

Deleting a space and deleting the internal state without going through the intended UX for un-managing or uninstalling an agent will not work well and users are unlikely to understand the consequences of it.

We would not intentionally build a button into Fleet's UI that mass deletes Fleet's internal state with no warning or protection for the user and we are worried with space awareness we have unintentionally created that via deleting a space and want to eliminate it.

nchaulet commented 1 month ago

@legrego it is blocking space deletion if a user have enrolled agent in that space something we can envisage? It will really solve our usecase and avoid user being in a unsolvable situation.

@cmacknz @kpollich Thinking loud here otherwise we could probably come with a hacky solution, as the problematic saved object here is the uninstall token, we could make that SO space agnostic and does the filtering based on space manually (not using the built in saved object space mechanism but using our own fields for that) This way if a space is deleted and recreated the user will have access to their active agents and uninstall tokens (they will loose their policies)

legrego commented 1 month ago

it is blocking space deletion if a user have enrolled agent in that space something we can envisage? It will really solve our usecase and avoid user being in a unsolvable situation.

Sorry for the delay. Based on what you've shared, this doesn't feel outside the realm of possibility. Let me discuss with the folks I pinged above, and we'll get back to you.

legrego commented 1 month ago

I discussed with @lukeelmers, @bitzandeb, and @rayafratkina today. We propose taking a progressive approach, rather than immediately move forward with blocking space deletion.

Could we instead start by showing a warning, which lists the enrolled agents that are impacted by this operation, and explain the consequences of deleting a space with enrolled agents? If we wanted to get fancy, we could also allow sufficiently authorized administrators to perform the unenrollment from this warning step.

If we learn that this warning is not sufficient for our users, then we could discuss other measures, such as blocking deletion.

nimarezainia commented 1 month ago

Thanks @legrego. We can't really list all the agents in this manner (can be in the 10s of thousands) but could certainly give a summary snapshot (like total of X agents in Y many policies).

The user would have a choice to Proceed or Cancel with the warning given - correct? to me this is pretty much blocking the deletion albeit by the user and not us). I think we can pursue this as a first option. We should strongly urge the user to either delete or move agents to another policy before doing this The concern however is that an unsuspecting user may just click "continue" (as we are all accustomed to do) and cause a lot of pain.

@cmacknz @kpollich @nchaulet WDYT?

nchaulet commented 1 month ago

The concern however is that an unsuspecting user may just click "continue" (as we are all accustomed to do) and cause a lot of pain.

I think if we go this way, it may be interesting to make the unenrollment token non space aware, so we have a recovery scenario for SDHs

cmacknz commented 1 month ago

What consequences does keeping the unenrollment tokens global have? Doing that would eliminate the worst case scenario of users being unable to uninstall agents that can't be managed because the space was deleted.

@nchaulet if someone deletes a space, are the agent API keys still valid? If the agents keep checking in perhaps we can have some way to reassign them into a space that still exists, even if this is via an API call as a recovery mechanism for support in case this happens accidentally.

nchaulet commented 1 month ago

@nchaulet if someone deletes a space, are the agent API keys still valid? If the agents keep checking in perhaps we can have some way to reassign them into a space that still exists, even if this is via an API call as a recovery mechanism for support in case this happens accidentally.

Yes api key will still be valid, and agent will be visible in the UI if the user create the space again, the policy will not be visible again as it's stored in saved object and will be deleted during the space deletion

nchaulet commented 1 month ago

What consequences does keeping the unenrollment tokens global have? Doing that would eliminate the worst case scenario of users being unable to uninstall agents that can't be managed because the space was deleted.

We will not be able to use the saved object built-in mechanism to filter per space and have to build our own (that could be doable as that saved object is used only in a few places), the saved object will not be deleted during space deletion so we could have a recovery scenario and recreate the space to access the uninstall token.

nchaulet commented 3 weeks ago

@kpollich @cmacknz I am in the process of moving to mutiple saved object, and I would like to move a little more the discussions of having unenrollment tokens global (and does the namespace filtering outside of the saved object framework).

Having global unenrollment token will enable a recovery scenario, if a user recreate a deleted space he will see previously created unenrollment token and enrolled agent, and have a way to unenroll them.

kpollich commented 3 weeks ago

I'm in agreement that we should make unenrollment tokens global as a recovery, then applying filtering at the application level. Recreating the deleted space is a good thing to keep in mind, but honestly if the unenrollment tokens are fetchable via dev tools after the space has been deleted that will probably be good enough as a recovery mechanism.

nimarezainia commented 3 weeks ago

but honestly if the unenrollment tokens are fetchable via dev tools after the space has been deleted that will probably be good enough as a recovery mechanism.

especially if we had ample warning of the consequences before the user deletes the space. We can certainly document the recovery aspects of this.

elastic / kibana

[Fleet] Investigate side-effect of a space deletion #184864

Description