elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.71k stars 8.12k forks source link

[Fleet] Improve recoverability and stability of package installation #169147

Open kpollich opened 11 months ago

kpollich commented 11 months ago

Meta issue tracking the work for recoverability and stability of package installation

Currently, it's hard to recover a failed package installation as our only recourse is typically to reinstall the package. There's no granular recovery steps we can take, and we often lack visibility into which particular steps failed. It'd be ideal if we could build a more "state machine" like implementation for packages with specific recovery steps for each state transition along the way.

Ref https://github.com/elastic/kibana/issues/166857 Ref https://github.com/elastic/kibana/issues/166798

elasticmachine commented 11 months ago

Pinging @elastic/fleet (Team:Fleet)

criamico commented 11 months ago

Adding some considerations as discussed with @kpollich

We could start by looking at the specific steps that are covered by the installation process and documenting it. We have a complex state machine and we go through those steps (and a lot of side effect) every time an integration is installed, but we don't really have it documented anywhere and whole install process is a little opaque.

This brings me to the second point: whenever an integration goes to a bad state (like failed_install) we don't really have a way to restart from the failed step, but we need to force doing it all over. As highlighted in this comment, we could even implement retries on those steps, but currently we don't even have granularity on the steps. It's just a single endpoint and what we ask users to do is usually this:

# Force uninstall
DELETE kbn:api/fleet/epm/packages/<integration>/<version>
{
  "force": true
}

# Force reinstall
POST kbn:api/fleet/epm/packages/<integration>/<version>
{
  "force": true
}

Third consideration is that we could maybe reuse the new input template endpoint to simplify the installation process. The endpoint only returns the inputs, but we could easily reuse part of the logic to return the rest of the integration info and simplify the whole install flow. We could easily add an endpoint under the same namespace that returns the rest of the integrations info and not only the inputs.

criamico commented 8 months ago

Adding some comments per discussion with @nchaulet:

criamico commented 7 months ago

@kpollich @nchaulet I converted this ticket to a "meta" one and wrote some more tickets based on our discussion. Feel free to comment/update as needed.

criamico commented 1 month ago

@kpollich I split the items in phase 1 and added some further details in the descriptions as we discussed recently.

nimarezainia commented 9 hours ago

@kpollich what should we do with the remaining issues here? should we split them into another meta to be dealt with later?