[Fleet] Add feature to synchronize integrations on remote ES clusters

juliaElastic commented 2 months ago

Part of https://github.com/elastic/kibana/issues/187323

[ ] Add a remote output config to toggle synchronize integrations (see mockups below)
[ ] Add a remote output config to store remote cluster's kibana API as a secret field
[ ] Add a toggle and text field to Add/Edit output flyout to turn on synchronizing integrations and enter a remote kibana API key
[ ] Add UI instructions how to create an API key in remote kibana as a code block.
- Include in the instructions that the minimum privileges to create an API key is manage_own_api_key
- Include in the instructions that the encoded value should be copied and pasted from the response
[ ] Add logic to package install/update/uninstall API handlers to take remote outputs where synchronization is turned on, and install/update/uninstall the integration on the remote cluster using /api/fleet/epm/packages/<pkgName>/<pkgVersion> API

API request to show in the instructions:

 POST /_security/api_key
   {
     "name": "integration_sync", 
     "role_descriptors": {
       "integration_writer": { 
         "cluster": [],
        "indices":[],
        "applications": [{
           "application": "kibana-.kibana",
             "privileges": ["feature_fleet.all", "feature_fleetv2.agent_policies_all"],
             "resources": ["*"]
         }]
        }
     }
   }

elasticmachine commented 2 months ago

Pinging @elastic/fleet (Team:Fleet)

jillguyonnet commented 3 weeks ago

Hi @juliaElastic @nimarezainia 👋 Can you give me some feedback on the UI implementation proposition below (text is WIP)? I thought it might make sense to put the new option after the service token as it is specific to remote ES outputs.

I also have a question: can you confirm whether using secret storage vs. plain text storage should be independent between the service token and API key inputs? i.e. could the user want to use secret storage for the secret token and plain text storage for the API key, or do we enforce that these should be aligned?

jillguyonnet commented 3 weeks ago

Another UI question: what do you think of offering the user the possibility to generate an API with a click? We could also put the API request in a collapsible section for completeness. Something similar to the standalone agent onboarding flyout:

jillguyonnet commented 3 weeks ago

Third question 🙂 about this requirement:

Add logic to package install/update/uninstall API handlers to take remote outputs where synchronization is turned on, and install/update/uninstall the integration on the remote cluster using /api/fleet/epm/packages// API

I think I can see two paths that potentially need to respect the sync, please correct me if I'm mistaken:

When the user updates/deletes a remote output with sync enabled -> if there are agent policies using the output, check which integrations are assigned to these and sync their assets on the remote cluster
When the user installs/uninstalls an integration assigned to an agent policy using a remote output with sync enabled -> sync the integration assets on the remote cluster

I think the requirement above describes the second path, but it also sounds from the discussion that the first one would make sense. WDYT?

Edit: actually, I think there is this path as well:

When the user creates/updates/deletes an agent policy that uses a remote ES output with sync enabled -> sync the integration assets on the remote cluster

cmacknz commented 3 weeks ago

I also have a question: can you confirm whether using secret storage vs. plain text storage should be independent between the service token and API key inputs? i.e. could the user want to use secret storage for the secret token and plain text storage for the API key, or do we enforce that these should be aligned?

We should always use secret storage and never plain text for either of these IMO. Plain text for secrets and API keys should be a viable choice in a few places as possible.

nimarezainia commented 3 weeks ago

Another UI question: what do you think of offering the user the possibility to generate an API with a click? We could also put the API request in a collapsible section for completeness. Something similar to the standalone agent onboarding flyout:

if this can be done programmatically that would be great and much more simplified. This is being done on the remote Kibana correct? It can be done because you have the service token already?

nimarezainia commented 3 weeks ago

Third question 🙂 about this requirement:

Add logic to package install/update/uninstall API handlers to take remote outputs where synchronization is turned on, and install/update/uninstall the integration on the remote cluster using /api/fleet/epm/packages// API

I think I can see two paths that potentially need to respect the sync, please correct me if I'm mistaken:
1. When the user updates/deletes a remote output with sync enabled
   -> if there are agent policies using the output, check which integrations are assigned to these and sync their assets on the remote cluster

2. When the user installs/uninstalls an integration assigned to an agent policy using a remote output with sync enabled
   -> sync the integration assets on the remote cluster
I think the requirement above describes the second path, but it also sounds from the discussion that the first one would make sense. WDYT?

Edit: actually, I think there is this path as well: 3. When the user creates/updates/deletes an agent policy that uses a remote ES output with sync enabled -> sync the integration assets on the remote cluster

@jillguyonnet yes I believe that these are all possible paths. I would say for the 3rd one though, I believe you have to remove the integrations from the policy before policy can be deleted.

I think the problem we are facing here is that outputs are independent of the Agent Policy, but Integrations are tied to the agent policy to some extent. I was thinking that if Remote ES is created with synchronization enabled - then all "installed" integrations would synch. Regardless of whether they are part of a policy or not. This also is useful where we have reusable integrations. Agent policy deletion shouldn't haven affect on whether these integrations are installed remotely or not.

nimarezainia commented 3 weeks ago

I also have a question: can you confirm whether using secret storage vs. plain text storage should be independent between the service token and API key inputs? i.e. could the user want to use secret storage for the secret token and plain text storage for the API key, or do we enforce that these should be aligned?

We should always use secret storage and never plain text for either of these IMO. Plain text for secrets and API keys should be a viable choice in a few places as possible.

Can we change this now @jillguyonnet ? as in make it always a secret. Main concern is that we are getting a service token to do stuff remotely and that shouldn't be present in plain text at all.

Here we are relying on the user's permissions on the remote Kibana, we assume they are authorized which makes this palatable. Having the token in plain text circumvents that because users with less privilege may be looking at it.

jillguyonnet commented 3 weeks ago

Thanks for your feedback @nimarezainia - let me address the various points. cc @juliaElastic

Keeping the sync in the various paths

I was thinking that if Remote ES is created with synchronization enabled - then all "installed" integrations would synch. Regardless of whether they are part of a policy or not. This also is useful where we have reusable integrations. Agent policy deletion shouldn't haven affect on whether these integrations are installed remotely or not.

This would actually greatly simplify the logic. So to make sure I understand: if I have integrations A, B and C installed on my main cluster and I create a new remote ES output with sync enabled, then I want to install A, B and C on the remote cluster, irrespective of which agent policies use the remote output (even if it's not used at all). And if I install/update/uninstall integration D on my main cluster, then I also want to install/update/uninstall on my target cluster. Is that correct? Then, if I'm not missing anything, that would only mean handling the following paths on the main cluster:

Creating a remote output with sync enabled
Updating a remote output from sync disabled to sync enabled
Updating a non-remote output to remote with sync enabled
~Deleting the last remote output with sync enabled (?)~
Installing/updating/uninstalling an integration

API key generation by click

if this can be done programmatically that would be great and much more simplified. This is being done on the remote Kibana correct? It can be done because you have the service token already?

Yes, this is done on the remote cluster. I thought having a service token would allow that, but I'm not actually not sure anymore. I'd need to confirm.

Secret storage

This a great point, but I think it would be worth discussing and tracking it in a separate issue, as it potentially has larger scope than just these fields. I've created https://github.com/elastic/kibana/issues/199347 for this purpose.

Other

FYI we'll also need to get the user to input the remote Kibana URL and store it. I tried to see if we could infer it from the ES URL, but it doesn't seem feasible in this situation.

cmacknz commented 2 weeks ago

Add logic to package install/update/uninstall API handlers to take remote outputs where synchronization is turned on, and install/update/uninstall the integration on the remote cluster using /api/fleet/epm/packages// API

Have we documented the technical details of how the sync process is going to work yet? If not, can we?

For example, what happens if a package install completes in the main cluster and some of the remote clusters, but the Kibana instance restarts before all of the remote clusters are updated? How does the integration package eventually get synced into the remote clusters that are missing the integration?

jillguyonnet commented 2 weeks ago

Edit: details and better wording

I've opened https://github.com/elastic/kibana/pull/199978 with my WIP (see below for a screenshot). There are some open questions, so I'll centralise them here:

1. Regarding the logic of when we should sync integrations on remote. Based on the discussion so far, as I understand we want to install missing integrations when:

Creating a remote ES output with sync enabled
Updating a remote ES output from sync disabled to sync enabled
Updating a non-remote ES output to remote ES with sync enabled
Installing/updating an integration

In the current implementation, for the first 3 paths, the sync happens regardless of whether another remote ES output with integration sync enabled already exists. In other words, every time the user creates a remote ES output with sync enabled or updates an existing output to a remote ES output with sync enabled, then Kibana tries to sync the integrations. This is already working in https://github.com/elastic/kibana/pull/199978.

⚠ I'm encountering an issue with the 4th path: unless I'm missing something, it requires accessing a secret value (the API key) that is already stored as secret, something Kibana is not allowed to do. When creating/updating a remote ES output with sync enabled, the API key value is passed within the request, so it's usable, but from the package install flow it's already been stored as a secret. I'm not sure how to get around that (is this something that could be handled by Fleet Server?). @juliaElastic perhaps you would have some advice on that? I've added the entrypoints in https://github.com/elastic/kibana/pull/199978, but currently it fails because Kibana can't read the secret value.

2. Sync logic: as it is implemented now, the sync process installs (on the target cluster) the integrations that are installed on the source but not on the target cluster. Two questions I have on that are:

Do we want to exclude some integrations from that logic (e.g. fleet-server)?
It seems to me that it would be safer not to remove integrations on the target cluster (in the case that they are not installed in the source cluster) in case the user wants to keep them, that's why I haven't included deleting an integration in the 4th path above. Any thoughts on that?

3. A naive question: it is possible to set up multiple hosts on a remote ES output, but only one service token. Thought I'd confirm we only want to allow one Kibana URL/API key combination.

4. Technical details of how the sync is going to work (https://github.com/elastic/kibana/issues/192361#issuecomment-2471637164): my implementation so far is lacking some retry logic (which I'm looking to add). Besides that, I'm not sure there is an aim to achieve eventual consistency in case of failure, however https://github.com/elastic/kibana/issues/192363 is the next issue and will implement reporting failure in Fleet UI. One thing we might perhaps want is a way to manually sync, e.g. a button (that is, if there is a way around the issue with reading the secret value I mentioned above)?

kpollich commented 2 weeks ago

@cmacknz - Regarding using secrets here. One key part of this that @jillguyonnet has run into above is that Kibana needs access to the service token in order to make requests to remote Kibanas and install integrations. We're using a "push" model here where the "source" Kibana makes outgoing API requests to install integrations on remote clusters, and we need the service token in order to authenticate those requests. In the current model, I don't think we can store this service token as a secret since we need to read it again during the install/update integration flow Jill mentioned above. Kibana needs to read this token, and Kibana does not have read access to output secrets today, so these two features are presently incompatible.

cmacknz commented 2 weeks ago

Besides that, I'm not sure there is an aim to achieve eventual consistency in case of failure

IMO eventual consistency is 100% a technical requirement or we haven't built synchronization we have built "maybe do synchronization".

We're using a "push" model here where the "source" Kibana makes outgoing API requests to install integrations on remote clusters

A reason I am starting to ask about technical details is that I think eventual consistency would be much easier to achieve using a pull model where the remote clusters periodically look for updates from the main cluster. We get eventual consistency for fleet policies because every agent periodically checks for a new policy for an example of this working in practice.

I have also wondered if there are ways we can use CCR to help us here. We cannot use CCR for the system indices, but we could create a new standard index that represents a desired state or a set of API calls to make or integrations to install and replicate that. This would be closer to an event sourcing model which is harder to get right.

cmacknz commented 2 weeks ago

I am making an assumption that most users of this functionality are going to be larger enterprise customers or service provides trying to create multi-tenancy and they are not going to tolerate partial synchronization or synchronization sort of working.

kpollich commented 2 weeks ago

I have also wondered if there are ways we can use CCR to help us here. We cannot use CCR for the system indices, but we could create a new standard index that represents a desired state or a set of API calls to make or integrations to install and replicate that. This would be closer to an event sourcing model which is harder to get right.

I'm leaning towards this being a better approach as well, but iirc from @juliaElastic's + @jillguyonnet's technical discovery here there were concerns about using CCR in this way. Looking around for comments to that effect is proving fruitless, though, so I may be misremembering.

To summarize the change in approach: we could replicate some standardized index (e.g. _fleet-synced-integrations) across clusters that contains the desired set of integrations to be installed, then have a background task in Kibana responsible for periodically ensuring consistency with that index. We could still have a workflow where we'd use a service token provided during output creation to set up the cross cluster replication of that index, but we wouldn't have a need to store the service token for future synchronization calls. Since we're requiring the user to set up CCR in a specific way for a "managed" index, it probably stands to reason that we should provide some kind of guardrail here or prevent management access to this index outside of the kibana_system user.

jillguyonnet commented 2 weeks ago

Thanks both, it seems clear that the push model does not fit the expectations.

While it may only be a narrow first step, there is also an option to implement manual syncing (for example, a section within the Edit output flyout with inputs for a Kibana URL and API key + a button). The user would have to input the URL and API key every time (which don't get stored), which is not a great flow, but it would beat having to manually install integrations on the target cluster.

nimarezainia commented 2 weeks ago

To summarize the change in approach: we could replicate some standardized index (e.g. _fleet-synced-integrations) across clusters that contains the desired set of integrations to be installed, then have a background task in Kibana responsible for periodically ensuring consistency with that index. We could still have a workflow where we'd use a service token provided during output creation to set up the cross cluster replication of that index, but we wouldn't have a need to store the service token for future synchronization calls. Since we're requiring the user to set up CCR in a specific way for a "managed" index, it probably stands to reason that we should provide some kind of guardrail here or prevent management access to this index outside of the kibana_system user.

team, thank you for this discussion. I think the pull based would scale much better. Are we able to setup CCR for this index without user interaction?

Thinking out loud, CCR is Enterprised licenses, this feature on our end was going to be Enterprised licensed as well, so what we are proposing is that all the clusters involved will need to be at Enterprise license level in order to use the synchronization facility.

cmacknz commented 2 weeks ago

A CCR based approach also gets us past any limitations for deployments that use traffic filters https://www.elastic.co/guide/en/cloud/current/ec-enable-ccs.html#ec-ccs-ccr-traffic-filtering. This is a problem in the existing remote elasticsearch output, which is called out as a limitation in https://www.elastic.co/guide/en/fleet/current/remote-elasticsearch-output.html already.

CCR can get us actual consistency between clusters and bypasses network level restrictions from traffic filters, so hopefully there is no known reason we couldn't change the approach to be based on CCR.

This might require us to revisit the technical discovery phase here and do a dedicated prototype and redo the detailed design, but I'd rather take the extra time to build something that will work correctly in all situations than quickly release something we know has limitations that will require significant rework from the beginning.

nimarezainia commented 1 week ago

discussing this issue with @strawgate and we were wondering if there are other options that allow us to avoid CCR? concerns around our solution depending on another feature where cluster version mistmatch could cause issues, to a lesser extent licensing differences etc.

Option 1) could we perhaps embed the Integration info (name + version) in the event collected by the agent. Agent has been configured correctly as the integration is already installed and included in the agent policy on the Management cluster.

On the local cluster that has received these events run a watcher once every minute or so to check which integrations have been referenced. Install/update the integrations on the local cluster.

Option 2) In the Agent Policy curate the list of integrations + versions (this in effect is already i the config). Upon receiving the agent policy from the management cluster, the agent writes the integration info into a well known index on the local cluster that's being used for data. Watcher reads that index regularly to see what integration install locally.

kpollich commented 1 week ago

From a planning perspective, this needs to be moved out of our sprint in its current state so we can continue defining it. I'm taking this off of @jillguyonnet's plate for now and removing it from our sprint. @nimarezainia let's discuss this one when we meet today.

cmacknz commented 1 week ago

There are ways to solve this without CCR, but I don't think they'll be able to work if there are traffic filters enabled in the clusters. This feature would have the same limitation the existing remote Elasticsearch output has.

cmacknz commented 1 week ago

Option 2) In the Agent Policy curate the list of integrations + versions (this in effect is already i the config). Upon receiving the agent policy from the management cluster, the agent writes the integration info into a well known index on the local cluster that's being used for data. Watcher reads that index regularly to see what integration install locally.

This is using the agents to replicate the policy. The requirements and implications of this approach would be:

The agent is given an API key with sufficient permissions to write to this index (or perhaps Fleet API) by the management cluster.
The agent would have to avoid starting the underlying components that ship data until it confirmed the target cluster was up to date with the latest policy. This avoid indexing data before the ingest pipelines, templates, and mappings are in place.
This solution would require an API call per agent to synchronize the policy. This would lead to potentially 1000s of pointless API calls each time a new policy revision is updated, because only the first agent API call would have an effect.
There are edge cases to handle where agents make a synchronization attempt while one is in progress, when synchronization fails, ensuring we handle agents with different versions and agents not all immediately having the latest policy revision.

There is a fair amount of complexity here, but it does eliminate the traffic filtering restriction and could give us a way to let standalone agents install integrations before indexing. In a way this is bringing back the Beats setup command but as something that happens automatically at startup.

I see that this approach could solve problems for us, but using agents to do this in this way is definitely not the first solution anyone would pick to this problem and we'd mostly be doing this to work around self-imposed infrastructure limitations (traffic filtering between deployments without whitelisting the full range of possible ESS deployment IPs).

kpollich commented 1 week ago

If we build this without CCR we'd be solving a bunch of distributed system problems that are already solved by CCR and have mutliple Elasticsearch teams dedicated to solving them. We'd be signing on to maintain the complexity of solving those problems specifically for the ingest team's use case, and we probably won't be able to solve them as well.

I think the implementation here will be a lot more stable if we rely on CCR and "use the platform" as much as possible.

elastic / kibana