Superseded by #66056

Edits:

10 Jan 2019: Added short term plan and rolling upgrades
17 Jan 2019: Fleshed out details based on discussion with @tylersmalley
11 Feb 2019: Added a system design section to highlight performance vs complexity tradeoffs
18 Feb 2019: Added note about favouring a predictable downtime window above rolling upgrades for it's simplicity
19 Feb 2019: Added short term task to always allow a rollback without manual intervention if a migration fails
27 Mar 2020: Added motivation with high-level objectives

1. Motivation

Kibana version upgrades should have a minimal operational impact. To achieve this users should be able to rely on:

a predictable downtime window
a small downtime window
the ability to discover and remedy potential failures prior to initiating the downtime window
quick roll-back in case of failure
detailed documentation about the impact of downtime on the features they are using (e.g. actions, task manager, fleet, reporting)
(stretch goal) maintain read-only functionality during the downtime window

The biggest hurdle to achieving the above is Kibana’s Saved Object migrations. Migrations aren’t resilient to errors and requires manual intervention anytime one of the following classes of errors arise:

A bug in a plugin’s registered document transformation function causes it to throw an exception on valid data
Invalid data stored in Elasticsearch causes a plugin’s registered document transformation function to throw an exception
Failures resulting from a unhealthy Elasticsearch cluster:
1. maximum shards open
2. too many scroll contexts
3. circuit_breaking_exception (insufficient heap memory)
4. process_cluster_event_timeout_exception index-aliases, create-index, put-mappings
5. read-only indices due to low disk space (hitting the flood_stage watermark)
the Kibana process is killed while migrations are in progress

It is not possible to discover these failures before initiating downtime. Transformation function bugs (7) and invalid data (8) often force users to roll-back to a previous version of Kibana or cause hours of downtime. To retry the migration, users are asked to manually delete a .kibana_x index. If done incorrectly this can lead to data loss, making it a terrifying experience (restoring from a pre-upgrade snapshot is a safer alternative but not mentioned in the docs or logs).

Cloud users don’t have access to Kibana logs to be able to identify and remedy the cause of the migration failure. Apart from blindly retrying migrations by restoring a previous snapshot, cloud users are unable to remedy a failed migration and have to escalate to support which can further delay resolution.

Taken together, version upgrades often create a major operational impact and discourage users from adopting the latest features.

2. Short term plan

1. Dry run migrations (7.8)

Introduce the option to perform a dry run migration to allow administrators to locate and fix potential migration failures without taking their existing Kibana node(s) offline. https://github.com/elastic/kibana/issues/55404
Generate a dry run migration failure report to make it easy to create a support request from a failed migration dry run (https://github.com/elastic/kibana/issues/57313):
1. The report would be a NDJSON export of all failed objects.
2. If support receives such a report, we could modify all the objects to ensure the migration would pass and send this back to the client.
3. The client can then import the updated objects using the standard Saved Objects NDJSON import and run another dry run to verify all problems have been fixed.
Make running dry run migrations a required step in the upgrade procedure documentation
(Optional) Add dry run migrations to the standard cloud upgrade procedure?
2. Tag objects as “invalid” if their migration fails https://github.com/elastic/kibana/issues/55406
Tag objects as “invalid” if they cause an exception when being migrated, but don’t fail the entire migration.
Log an error message informing administrators that there are invalid objects which require inspection. For each invalid object, provide an error stack trace to aid in debugging.
Administrators should be able to generate a migration report (similar to the one dry run migrations create) which is an NDJSON export of all objects tagged as “invalid”.
When an invalid object is read the Saved Objects client will throw an invalid object exception which should include a link to documentation to help administrators resolve migration bugs.
Educate Kibana developers to no longer simply write back an unmigrated document if an exception occurred. A migration should either successfully transform the object or throw.

Open questions: How do we deal with an invalid document that has attributes that are incompatible with the mappings for this type? We could add a invalidJSON string mapping and if persisting fails due to a mapping mismatch, persist the invalid documents as a string.

3. Rolling back after a failed migration shouldn't require manually removing the lock

Kibana acquires a different lock per index (i.e. one for .kibana_n and one for .kibana_task_manager). If one index migration succeeds but the other fails, it is no longer possible to rollback to a previous version of Kibana since one of the indices contains newer data. If a migration fails, users should always be able to minimize downtime by rolling back Kibana to a previous version until they're able to resolve the root cause of the migration failure. (Due to Kibana/ES compatibility, this will only be possible during minor upgrades)

4. Improve Saved Object validation (7.9)

Update the existing Saved Object validation API to accept a transform function e.g. from (doc: RawSavedObjectDoc) => void; to (doc: RawSavedObjectDoc) => RawSavedObjectDoc;
Educate developers on the benefits of Saved Objects validation (it’s a prototype pollution risk)
Make Saved Object validation a requirement in the New Platform

Rolling upgrades (8.x) No longer relevant

Note: Rolling upgrades introduce significant complexity for plugins and risk of bugs. We assume that as long as the downtime window is predictable, downtime as such is not a problem for our users. Since this allows us to have a dramatically simpler system we won't aim to implement rolling upgrades unless this assumption is proven wrong.

System Design:

1. Node upgrade strategy

There are two possible strategies for upgrading each of the Kibana nodes in a cluster of N nodes:

Creating N new, upgraded nodes in parallel to the outdated nodes. Only when the upgraded nodes are ready, the N outdated nodes are removed.
Upgrading nodes one at a time, waiting for the upgrade to complete before moving on to the next node.

Cluster doubling upgrade: requires 2N nodes to upgrade a N-node cluster, throughput capacity 100%.

Time	Outdated nodes	Upgraded nodes	Throughput capacity
Before upgarde	node1, node2, node3		100%
During upgrade	node1, node2, node3	node4, node5, node6	100%
After upgrade		node4, node5, node6	100%

In-place upgrade: requires N nodes to upgrade a N-node cluster, maximum throughput capacity: (N-1)/N%

Time	Outdated nodes	Upgraded nodes	Throughput capacity
Before upgarde	node1, node2, node3		100%
During upgrade	node1, node2	node3	(N-1)/N% (max) 0% (min)
	node1	node2, node3	(N-1)/N% (max) 0% (min)
After upgrade		node1, node2, node3	100%

Discussion: Since Kibana is sometimes deployed on physical infrastructure, we cannot temporarily double the cluster size during a migration like you would be able to on cloud infrastructure. Support for in-place upgrades is, therefore, required.

2. Traffic routing strategy

Any upgrade scenario with a fixed cluster size (see “Node upgrade strategy”) will temporarily reduce the cluster throughput. However, how traffic is routed between outdated and upgraded nodes affects the temporary throughput of the cluster.

Upgrade nodes one-by-one by suspending new traffic to the node and waiting for existing connections to drain (“rolling upgrade”). Once the upgrade is complete, resume connections to the node and repeat the same procedure for the next node. Connections are routed to both outdated and upgraded nodes at the same time. Throughput drops to: (N-1)/N% for (Δdrain + Δupgrade_node) x N
Suspend all new connection while waiting for existing connections to drain. Upgrade all nodes. Resume connections once all nodes have been upgraded. Throughput drops to: 0% for Δdrain + Δupgrade_node.
Drain existing connections from half of the outdated nodes and upgrade them. Suspend all new connections and drain existing connections from the remaining outdated nodes. Once existing connections are drained, resume new connections by routing them to the upgraded nodes. Upgrade remaining half of the nodes and then route connections to all nodes. Throughput drops to 50% for Δdrain + Δupgrade_node, then 0% for Δdrain. (slightly less downtime but a lot of complexity)

Discussion: Δdrain fundamentally depends on how long it takes for Elasticsearch to respond but can be as long as the HTTP request timeout of 30 seconds. If zero throughput for 30+ seconds would be considered downtime, the only connection routing strategy that satisfies the constraints would be to do the “rolling upgrades” in (1).

3. Plugin complexity to support rolling upgrades

For a plugin to support rolling upgrades it needs to maintain backwards compatibility in order for outdated and upgraded nodes to both service requests during the upgrade process. Doing this for more than one minor back will add significant complexity and risk of bugs.

Having to maintain backwards compatibility for an entire major reduces the value of migrations. However, even if “up“ and “down” transformations need to be written business logic can always read and write in the latest format. To support “down” transformations, migrations has to be loss-less and can only operate on a single document at a time.

To reduce the complexity of reading and writing backwards compatible documents, rolling upgrades could be limited to incremental minor or major upgrades:

Current version	Newer version	Rolling upgrade supported
7.3.x	7.4.x	Yes
7.3.x	7.5.x	No
7.last.x	8.0.x	Yes
7.last.x	8.1.x	No

Not all plugins require zero-downtime upgrades so this should be an opt-in feature for plugins where the added complexity can be justified. Plugins can opt out of supporting rolling upgrades by checking the progress of their migrations and blocking all API operations until migrations have completed.

Algorithm

To facilitate rolling upgrades, Kibana will maintain the following state for all nodes connected to a Elasticsearch cluster:

Kibana Version
Last seen timestamp

Kibana won’t block startup until all migrations are complete. Instead, documents will be migrated asynchronously in the background or when they’re read.

When a new Kibana node comes online it will:

Check the version of other nodes in the “cluster”
Are older nodes present?
1. Yes: continue to read and write in the backwards compatible format (migration framework will not migrate documents on read)
2. No:
  1. Migration framework migrates all documents on read
  2. Schedule background migration of all other documents
    1. Use version number with optimistic concurrency to prevent overwriting data

For plugins that don’t opt-in to rolling upgrades:

Check the version of other nodes in the “cluster”
Are older nodes present?
1. Yes: queue all writes until the older nodes are removed
  1. Using optimistic concurrency control: Could lead to failed write if a document gets updated by an older node after an update was queued on a newer node.
  2. Using last write wins: Could lead to data loss if a document gets updated by an older node after an updated was queued on a newer node.
  3. Although the longer time window increases the likelihood, both of these can already occur under normal operating conditions.
2. No:
  1. Migration framework migrates all documents on read
  2. Schedule background migration of all other documents
    1. Use version number with optimistic concurrency to prevent overwriting data

Open questions:

In order to do migrations asynchronously after startup we need to either remove the ability to make mapping changes that require a reindex or read from both the new and old indexes. Do we want to support mapping changes?

Pinging @elastic/kibana-platform (Team:Platform)

One major change which we have discussed recently, and something investigated during the development of the first saved object migration, is not doing them at all. Let me explain.

We would continue to write migrations, however, those migrations would not be run on startup. They would instead be used while the object are read, or written. The reason we decided not to do this the first time around, was one of the main things we were looking to achieve was the ability to change a field mapping type. The difference here is that we have the task manager to be able to easily process this in the background while they are running one-by-one, which we could provide a status in the UI for. For most migrations, and I have provided an audit here, there would only be a negligible performance regression while the migrations are not persisted. One thing we would need to enforce is that any object can be read, then turned around and be written. So if you're changing a field mapping type, you would need to use a new property and handle that in the implementation of your search (this should be very rare).

One benefit to this is if a migration fails for a single object, we just don't migrate that object and only it will be affected. We can then send that information to Pulse to be alerted on.

Another thing which migrations set out to do, ensured we have the correct mappings since users used to frequently mess with them. This will be mitigated by them being in an Elasticsearch plugin and not manageable by the user.

I just stumbled across this issue while searching for something else, but noticed that this bit might need wider discussion:

To support rolling upgrades, newer Kibana nodes should be able to read and write Saved Objects in the format of the existing Kibana nodes. To reduce the complexity of reading and writing backwards compatible documents, rolling upgrades will only be possible for incremental minor or major upgrades:

Current version Newer version Rolling upgrade supported

7.3.x 7.4.x Yes

7.3.x 7.5.x No

7.last.x 8.0.x Yes

7.last.x 8.1.x No

This policy is different to Elasticsearch's rolling upgrade support policy. It will lead to frustration because many users do rolling upgrades from one major to another quite a long time after release of the new major, for example, 6.8.5 -> 7.5.1. This is supported in Elasticsearch - you can upgrade from the latest minor of one major to the latest minor of the next major. Also many users do not install every minor release, for example they might go 6.8 -> 7.2 -> 7.5 -> 7.7 -> 7.10 -> 8.4. I think that having a different policy for Kibana will make it impossible for most users to take advantage of rolling upgrades with Kibana.

/cc @clintongormley

I agree with https://github.com/elastic/kibana/issues/52202#issuecomment-574120551 - users should be able to go from any older minor to any newer minor in a single step with a rolling upgrade.

@droberts195 and @clintongormley These are still very early design drafts and we will create an RFC for wider feedback, but your early input is definitely appreciated.

To allow for a rolling upgrade a newer node needs to continue operating in a backwards compatible way until all nodes have been upgraded. This includes API's as well as the format of documents written to Elasticsearch. This is probably very similar to Elasticsearch rolling upgrades, but my assumption is that Kibana has a much higher API churn rate than Elasticsearch. I also assume that this high rate of change is necessary to support the rate of innovation on Kibana. Maintaining a backwards compatibility layer for an entire major will introduce a lot of complexity. With rolling upgrades being a new concept to Kibana there's a risk that teams don't yet have the maturity to develop and evolve API's in this way.

There's a lot of assumptions and unknowns here, but I think there's merit in starting with rolling minor upgrades as a first step and later building towards doing rolling upgrades from the latest minor to any minor in the next major.

Having said that @tylersmalley and I discussed the idea of making rolling upgrades optional for each plugin. Some plugins might be supporting mission critical work loads, whereas the impact of not being able to save a dashboard while waiting for all the nodes in the cluster to be bumped is much lower.

If rolling upgrades are implemented per plugin we can build up experience before attempting to implement this for all of Kibana's plugins.

The browser-side Saved Objects client doesn't use concurrency control for Saved Object updates https://github.com/elastic/kibana/blob/feceb0f98eb817f065834f8b6c9c628cee41383a/src/core/public/saved_objects/simple_saved_object.ts#L72-L75 Doing a quick search it doesn't seem like we use the SavedObjectsUpdateOptions.version much on the server-side either.

This means we'll get data loss if two clients open a saved object, then make different changes and save their changes.

Since rolling upgrades might not be a requirement any longer, we should instead focus on making upgrade downtime more predictable and avoid the need for manual intervention in the case of failure.

There are two classes of problems that cause upgrade downtime:

data transformation downtime: bugs in the transformation function, or unexpected data in Elasticsearch
migration system downtime: the migration framework itself fails
1. snapshot in progress https://github.com/elastic/kibana/issues/47808 (fixed in 7.6.2)
2. [Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [2683]/[1000] maximum shards open;];
3. search_phase_execution_exception Trying to create too many scroll contexts. Must be less than or equal to: [500]
4. [circuit_breaking_exception] [parent] Data too large, data for [] would be [2063683184/1.9gb], which is larger than the limit of [2023548518/1.8gb]
5. [process_cluster_event_timeout_exception] failed to process cluster event (index-aliases) within 30s
6. [process_cluster_event_timeout_exception] failed to process cluster event (create-index [.kibana_task_manager_1], cause [api]) within 30s
7. [search_phase_execution_exception] all shards failed
8. read-only indices caused by low disk space hitting cluster.routing.allocation.disk.watermark.flood_stage
9. Dynamic field mapping template causes incorrect mappings for newly created index https://github.com/elastic/kibana/issues/57664#issuecomment-587364091 will be solved by https://github.com/elastic/elasticsearch/issues/50251
10. Kibana doesn't support rolling upgrades so it's possible that running different versions simultaneously could cause upgrade/migration problems. I haven't been able to confirm exactly why this would result in a failure.

The root cause of (ii) - (viii) is configuration or performance issues with the ES cluster, Kibana cannot prevent or work around these, but it's important that the Kibana upgrade / migration will resolve automatically once these issues go away. This would require an expiring lock so that another node can re-attempt a migration.

Since rolling upgrades might not be a requirement any longer, we should instead focus on making upgrade downtime more predictable and avoid the need for manual intervention in the case of failure.

++, summarizing some discussion from last week about rolling upgrades:

We should not consider rolling upgrades a hard requirement for Kibana. Downtime, while not ideal, is OK if we can minimize the window and make upgrading predictable and reliable.
Kibana remaining readable but not writable would be a "bonus"/"nice to have", but not a hard requirement either.

Some specific things that did come up during discussion (and reading above I think we have considered these for the most part):

Ability to have multiple Kibana instances up at the same time, but running different versions. These don't have to be available, but they shouldn't be able to write data we then lose.
API clients should be considered, in terms of what they can expect, how they should respond during an upgrade. For example Elasticsearch returns a specific error code (503) and can indicate a retry-later header to cue the client to retry later. The effects of these errors when running Kibanas behind a load balancer should also be understood.
Behaviours of internal systems like task management and alerting are understood and documented. For example, if there are delays in running tasks, can we warn about this state, how does the the system recover

cc @clintongormley @skearns64

elastic / kibana

Improve Saved Object Migrations to minimize operational impact of Kibana upgrades #52202