[Fleet] Block Kibana startup for Fleet setup completion

joshdover commented 2 years ago

In order to provide a smoother upgrade experience for users of Fleet, we would like to block Kibana startup in order to install necessary ingest assets into Elasticsearch. The purpose of blocking is to provide admins a clear point in time where it's safe to start upgrading other components that depend on Fleet ingest assets being upgraded, such as Fleet Server and Elastic Agent.

In order to make this change as non-disruptive as possible, we want to ensure:

Fleet setup is reliable as possible
Fleet setup is a fast as possible

To that end, before we start blocking Kibana startup we need to complete the following tasks:

[x] https://github.com/elastic/kibana/issues/118423
[x] https://github.com/elastic/kibana/issues/125097
[x] https://github.com/elastic/kibana/issues/108456
- This allows us to improve setup performance by not installing any packages on the initial setup. Instead we'll only need to install base Fleet assets (like our shared component template and ingest pipeline), create the default Elasticsearch output, upgrade packages that UIs in depend on, such as APM, Synthetics, and Endpoint.
[ ] https://github.com/elastic/kibana/issues/108993 - @elastic/kibana-core
[x] https://github.com/elastic/kibana/issues/121639

The scope of this issue should make the following changes in a single PR:

[ ] Integrate with new blocking task API introduced to Kibana Core in #108993
- Add a xpack.fleet.setup.max_retries config that defaults to 5
- Attempt to run Fleet setup in this hook up to max_retries, then throw exception and crash Kibana
- Ensure that setup failures include adequate logging in default logging configuration
[ ] https://github.com/elastic/kibana/issues/120237
[ ] Remove calls to /api/fleet/setup API from UI code

Optionally, we'd like to improve package install and upgrade performance to minimize the impact of blocking Kibana startup. While this is likely not to be considered a blocker, it may make upgrades more painful or confusing to users. The primary bottleneck in this process is in Elasticsearch, which https://github.com/elastic/elasticsearch/issues/77505 may solve.

elasticmachine commented 2 years ago

Pinging @elastic/fleet (Team:Fleet)

joshdover commented 2 years ago

cc @elastic/kibana-core @kobelb Please let us know if there are any other items you'd like to see before we start blocking Kibana boot.

joshdover commented 2 years ago

@kpollich @nchaulet I've finished fleshing out the scope above, anything you think I'm missing for this change?

kpollich commented 2 years ago

@joshdover LGTM!

juliaElastic commented 2 years ago

@joshdover

Remove calls to /api/fleet/setup API from UI code

Does this mean setup will only run once during kibana startup? I'm asking because ensuring preconfiguration is called during fleet setup, so if we only call it once, it makes this discussion irrelevant: https://github.com/elastic/kibana/issues/124004

juliaElastic commented 2 years ago

How much risk is it to schedule this task at the same iteration as the prerequisite in kibana-core?

Also, what is the impact on downstream dependencies like fleet-server, e2e-testing? Do they have to change anything? E.g. they are calling Fleet setup API to check that Fleet is ready

What happens for users who start with Kibana installation without Fleet, and enable Fleet later? Is Fleet setup going to run then? We should test this in cloud/on-prem.

joshdover commented 2 years ago

How much risk is it to schedule this task at the same iteration as the prerequisite in kibana-core?

Kibana Core has this scheduled in their current sprint which ends in ~1.5 weeks and a rough plan has been agreed upon. I suspect they won't be a blocker for much longer.

Also, what is the impact on downstream dependencies like fleet-server, e2e-testing? Do they have to change anything? E.g. they are calling Fleet setup API to check that Fleet is ready

This is designed to not break the setup API. Though calling that API won't be necessary any longer (just verifying Kibana is healthy is good enough now), we won't be removing the setup API at this point (but we should consider deprecating it in a later release).

What happens for users who start with Kibana installation without Fleet, and enable Fleet later? Is Fleet setup going to run then? We should test this in cloud/on-prem.

In 8.0 we started running Fleet setup when Kibana starts up, regardless of whether or not the user is using Fleet. The change in this issue only impacts one aspect which is to block Kibana's HTTP server from serving traffic until Fleet setup has completed. This is necessary to ensure that users and orchestration layers do not upgrade Elastic Agent instances until 1st party Stack-aligned Fleet packages (eg. APM, Synthetics) are upgraded. If Elastic Agent or Fleet Server is upgraded before these packages are upgraded, new fields may be ingested with the incorrect mappings, breaking the related application UIs.

In summary, this change doesn't change what we're setting up or how we decide to do it, it only improves the robustness of the ingestion layers in the Stack.

Also worth noting that since we removed the default policies, Fleet setup does not install any packages or create agent policies in the default self-managed configuration.

joshdover commented 2 years ago

Does this mean setup will only run once during kibana startup? I'm asking because ensuring preconfiguration is called during fleet setup, so if we only call it once, it makes this discussion irrelevant:

124004

It will run every time a Kibana node is restarted, so it's not exactly only once, but it won't be triggered multiple times in a single run of Kibana unless the user manually calls the setup API. I think since it will still run on each startup, #124004 is still relevant?

juliaElastic commented 2 years ago

It will run every time a Kibana node is restarted, so it's not exactly only once, but it won't be triggered multiple times in a single run of Kibana unless the user manually calls the setup API. I think since it will still run on each startup, #124004 is still relevant?

Sounds right, so we might have to ask users to restart kibana to refresh/fix the preconfigured policies

joshdover commented 2 years ago

Sounds right, so we might have to ask users to restart kibana to refresh/fix the preconfigured policies

Yes, or use the API manually. I also think we may see this issue go away or resolve itself with the addition of bundled packages used for preconfiguration.

joshdover commented 2 years ago

I've been thinking about this more and I think we should consider delaying this change until 8.3 or later. We've had a lot of changes in Fleet's setup logic in the past several releases and I think we could benefit from having additional time to make improvements in testing and getting feedback from customers, support & Cloud before we move forward with blocking Kibana's startup. Blocking Kibana startup carries a high weight of responsibility since anything that goes wrong in our setup code will make the entire UI unusable.

The main motivation for this change is to improve reliability of Stack upgrades by making it more obvious to sysadmins to not upgrade ingest components (specifically, Fleet Server and Elastic Agent) until after Kibana has upgraded the necessary ingest assets to avoid breaking ingest for Elastic Agent monitoring, Synthetics, and APM. Endpoint is not affected since Agent downloads the the version that corresponds to the integration package version. I also believe Synthetics is not affected in practice because Heartbeat has not traditionally had any breaking changes in schema.

This change will not actually enforce that sysadmins do not upgrade Fleet Server or Elastic Agent before Kibana has upgraded the integration packages, so it is not a guaranteed improvement, but a probabilistic one. The window of time between Kibana's UI being available and Fleet setup completing is quite small (we're talking ~20s in the average case, possibly 5 minutes in the worst case) so the window of time a user could hit this scenario is narrow.

It's also worth noting that this change will affect all users of the Stack, whether or not they're using Fleet yet. I think the risks of breaking Kibana for this large population is higher than the risk we're currently taking on by having this small window of time open where it's not obvious to a sysadmin whether or not they can start upgrading the ingest components.

I'd like some feedback from affected parties, namely:

APM Server - @simitt
Synthetics - @andrewvc
Kibana Core - @lukeelmers
Elastic Agent control plane - @ph

ph commented 2 years ago

@joshdover +1 to delay it, the changes in the setup logic have not been easy, we will improve. Could we prioritize an automated test for 8.2 for cloud this will help us iterate on 8.3.

lukeelmers commented 2 years ago

@joshdover +1 from me to delay as well, and revisit the need for it later.

In general we'd like to avoid opening up a mechanism like this from core unless it is a last resort. As you mention, this is something which would affect all stack users... so I'd prefer to exhaust all other possible options before taking this step.

simitt commented 2 years ago

APM Server has implemented a check for 8.0 to verify that the installed apm packages are version aligned with the apm-server, and only starts sending data to ES once this requirement is fulfilled. This was necessary to not already run into problems in 8.0/8.1. No need for blocking the startup anymore from apm side (cc @elastic/apm-server ).

joshdover commented 2 years ago

No need for blocking the startup anymore from apm side

@simitt To be clear, do we not ever need this or is it just not as pressing? My understanding was that it's possible APM could get backed up and start dropping traces if there's a delay in completing Fleet setup for some reason. By blocking, we'd be able to more easily enable orchestration layers like Cloud and ECK to delay upgrading Agents or APM Server until Fleet setup has completed in Kibana.

Alternatively, we could start publishing a degraded or unavailable health status from Kibana's status API. However, today Cloud does not use this endpoint to decide when to start upgrading the other components, so we'd need to make some changes there.

axw commented 2 years ago

Silvia is out sick, so I'll take a stab at answering.

The check that Silvia mentioned in https://github.com/elastic/kibana/issues/120616#issuecomment-1040247540 does mean that ingestion will be blocked and start dropping data if installing/upgrading the APM package doesn't happen in a timely manner. I think we would still like this eventually, but not as urgently.

elastic / kibana

[Fleet] Block Kibana startup for Fleet setup completion #120616

124004