Closed rudolf closed 4 years ago
Pinging @elastic/kibana-platform (Team:Platform)
Also related to https://github.com/elastic/kibana/issues/49764
One major change which we have discussed recently, and something investigated during the development of the first saved object migration, is not doing them at all. Let me explain.
We would continue to write migrations, however, those migrations would not be run on startup. They would instead be used while the object are read, or written. The reason we decided not to do this the first time around, was one of the main things we were looking to achieve was the ability to change a field mapping type. The difference here is that we have the task manager to be able to easily process this in the background while they are running one-by-one, which we could provide a status in the UI for. For most migrations, and I have provided an audit here, there would only be a negligible performance regression while the migrations are not persisted. One thing we would need to enforce is that any object can be read, then turned around and be written. So if you're changing a field mapping type, you would need to use a new property and handle that in the implementation of your search (this should be very rare).
One benefit to this is if a migration fails for a single object, we just don't migrate that object and only it will be affected. We can then send that information to Pulse to be alerted on.
Another thing which migrations set out to do, ensured we have the correct mappings since users used to frequently mess with them. This will be mitigated by them being in an Elasticsearch plugin and not manageable by the user.
I just stumbled across this issue while searching for something else, but noticed that this bit might need wider discussion:
To support rolling upgrades, newer Kibana nodes should be able to read and write Saved Objects in the format of the existing Kibana nodes. To reduce the complexity of reading and writing backwards compatible documents, rolling upgrades will only be possible for incremental minor or major upgrades:
Current version Newer version Rolling upgrade supported 7.3.x 7.4.x Yes 7.3.x 7.5.x No 7.last.x 8.0.x Yes 7.last.x 8.1.x No
This policy is different to Elasticsearch's rolling upgrade support policy. It will lead to frustration because many users do rolling upgrades from one major to another quite a long time after release of the new major, for example, 6.8.5 -> 7.5.1. This is supported in Elasticsearch - you can upgrade from the latest minor of one major to the latest minor of the next major. Also many users do not install every minor release, for example they might go 6.8 -> 7.2 -> 7.5 -> 7.7 -> 7.10 -> 8.4. I think that having a different policy for Kibana will make it impossible for most users to take advantage of rolling upgrades with Kibana.
/cc @clintongormley
I agree with https://github.com/elastic/kibana/issues/52202#issuecomment-574120551 - users should be able to go from any older minor to any newer minor in a single step with a rolling upgrade.
@droberts195 and @clintongormley These are still very early design drafts and we will create an RFC for wider feedback, but your early input is definitely appreciated.
To allow for a rolling upgrade a newer node needs to continue operating in a backwards compatible way until all nodes have been upgraded. This includes API's as well as the format of documents written to Elasticsearch. This is probably very similar to Elasticsearch rolling upgrades, but my assumption is that Kibana has a much higher API churn rate than Elasticsearch. I also assume that this high rate of change is necessary to support the rate of innovation on Kibana. Maintaining a backwards compatibility layer for an entire major will introduce a lot of complexity. With rolling upgrades being a new concept to Kibana there's a risk that teams don't yet have the maturity to develop and evolve API's in this way.
There's a lot of assumptions and unknowns here, but I think there's merit in starting with rolling minor upgrades as a first step and later building towards doing rolling upgrades from the latest minor to any minor in the next major.
Having said that @tylersmalley and I discussed the idea of making rolling upgrades optional for each plugin. Some plugins might be supporting mission critical work loads, whereas the impact of not being able to save a dashboard while waiting for all the nodes in the cluster to be bumped is much lower.
If rolling upgrades are implemented per plugin we can build up experience before attempting to implement this for all of Kibana's plugins.
The browser-side Saved Objects client doesn't use concurrency control for Saved Object updates https://github.com/elastic/kibana/blob/feceb0f98eb817f065834f8b6c9c628cee41383a/src/core/public/saved_objects/simple_saved_object.ts#L72-L75 Doing a quick search it doesn't seem like we use the SavedObjectsUpdateOptions.version
much on the server-side either.
This means we'll get data loss if two clients open a saved object, then make different changes and save their changes.
Since rolling upgrades might not be a requirement any longer, we should instead focus on making upgrade downtime more predictable and avoid the need for manual intervention in the case of failure.
There are two classes of problems that cause upgrade downtime:
[Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [2683]/[1000] maximum shards open;];
search_phase_execution_exception Trying to create too many scroll contexts. Must be less than or equal to: [500]
[circuit_breaking_exception] [parent] Data too large, data for [
] would be [2063683184/1.9gb], which is larger than the limit of [2023548518/1.8gb]
[process_cluster_event_timeout_exception] failed to process cluster event (index-aliases) within 30s
[process_cluster_event_timeout_exception] failed to process cluster event (create-index [.kibana_task_manager_1], cause [api]) within 30s
[search_phase_execution_exception] all shards failed
cluster.routing.allocation.disk.watermark.flood_stage
The root cause of (ii) - (viii) is configuration or performance issues with the ES cluster, Kibana cannot prevent or work around these, but it's important that the Kibana upgrade / migration will resolve automatically once these issues go away. This would require an expiring lock so that another node can re-attempt a migration.
Since rolling upgrades might not be a requirement any longer, we should instead focus on making upgrade downtime more predictable and avoid the need for manual intervention in the case of failure.
++, summarizing some discussion from last week about rolling upgrades:
Some specific things that did come up during discussion (and reading above I think we have considered these for the most part):
Ability to have multiple Kibana instances up at the same time, but running different versions. These don't have to be available, but they shouldn't be able to write data we then lose.
API clients should be considered, in terms of what they can expect, how they should respond during an upgrade. For example Elasticsearch returns a specific error code (503) and can indicate a retry-later header to cue the client to retry later. The effects of these errors when running Kibanas behind a load balancer should also be understood.
Behaviours of internal systems like task management and alerting are understood and documented. For example, if there are delays in running tasks, can we warn about this state, how does the the system recover
cc @clintongormley @skearns64
Closing in favour of https://github.com/elastic/kibana/pull/66056
Superseded by #66056
Edits:
1. Motivation
Kibana version upgrades should have a minimal operational impact. To achieve this users should be able to rely on:
The biggest hurdle to achieving the above is Kibana’s Saved Object migrations. Migrations aren’t resilient to errors and requires manual intervention anytime one of the following classes of errors arise:
It is not possible to discover these failures before initiating downtime. Transformation function bugs (7) and invalid data (8) often force users to roll-back to a previous version of Kibana or cause hours of downtime. To retry the migration, users are asked to manually delete a
.kibana_x
index. If done incorrectly this can lead to data loss, making it a terrifying experience (restoring from a pre-upgrade snapshot is a safer alternative but not mentioned in the docs or logs).Cloud users don’t have access to Kibana logs to be able to identify and remedy the cause of the migration failure. Apart from blindly retrying migrations by restoring a previous snapshot, cloud users are unable to remedy a failed migration and have to escalate to support which can further delay resolution.
Taken together, version upgrades often create a major operational impact and discourage users from adopting the latest features.
2. Short term plan
1. Dry run migrations (7.8)
2. Tag objects as “invalid” if their migration fails https://github.com/elastic/kibana/issues/55406
Open questions: How do we deal with an invalid document that has attributes that are incompatible with the mappings for this type? We could add a
invalidJSON
string mapping and if persisting fails due to a mapping mismatch, persist the invalid documents as a string.3. Rolling back after a failed migration shouldn't require manually removing the lock
Kibana acquires a different lock per index (i.e. one for
.kibana_n
and one for.kibana_task_manager
). If one index migration succeeds but the other fails, it is no longer possible to rollback to a previous version of Kibana since one of the indices contains newer data. If a migration fails, users should always be able to minimize downtime by rolling back Kibana to a previous version until they're able to resolve the root cause of the migration failure. (Due to Kibana/ES compatibility, this will only be possible during minor upgrades)4. Improve Saved Object validation (7.9)
(doc: RawSavedObjectDoc) => void;
to(doc: RawSavedObjectDoc) => RawSavedObjectDoc;
Rolling upgrades (8.x)No longer relevantNote: Rolling upgrades introduce significant complexity for plugins and risk of bugs. We assume that as long as the downtime window is predictable, downtime as such is not a problem for our users. Since this allows us to have a dramatically simpler system we won't aim to implement rolling upgrades unless this assumption is proven wrong.
System Design:
1. Node upgrade strategy
There are two possible strategies for upgrading each of the Kibana nodes in a cluster of N nodes:
Cluster doubling upgrade: requires 2N nodes to upgrade a N-node cluster, throughput capacity 100%.
In-place upgrade: requires N nodes to upgrade a N-node cluster, maximum throughput capacity: (N-1)/N%
Discussion: Since Kibana is sometimes deployed on physical infrastructure, we cannot temporarily double the cluster size during a migration like you would be able to on cloud infrastructure. Support for in-place upgrades is, therefore, required.
2. Traffic routing strategy
Any upgrade scenario with a fixed cluster size (see “Node upgrade strategy”) will temporarily reduce the cluster throughput. However, how traffic is routed between outdated and upgraded nodes affects the temporary throughput of the cluster.
Drain existing connections from half of the outdated nodes and upgrade them. Suspend all new connections and drain existing connections from the remaining outdated nodes. Once existing connections are drained, resume new connections by routing them to the upgraded nodes. Upgrade remaining half of the nodes and then route connections to all nodes. Throughput drops to 50% for Δdrain + Δupgrade_node, then 0% for Δdrain.(slightly less downtime but a lot of complexity)Discussion: Δdrain fundamentally depends on how long it takes for Elasticsearch to respond but can be as long as the HTTP request timeout of 30 seconds. If zero throughput for 30+ seconds would be considered downtime, the only connection routing strategy that satisfies the constraints would be to do the “rolling upgrades” in (1).
3. Plugin complexity to support rolling upgrades
For a plugin to support rolling upgrades it needs to maintain backwards compatibility in order for outdated and upgraded nodes to both service requests during the upgrade process. Doing this for more than one minor back will add significant complexity and risk of bugs.
Having to maintain backwards compatibility for an entire major reduces the value of migrations. However, even if “up“ and “down” transformations need to be written business logic can always read and write in the latest format. To support “down” transformations, migrations has to be loss-less and can only operate on a single document at a time.
To reduce the complexity of reading and writing backwards compatible documents, rolling upgrades could be limited to incremental minor or major upgrades:
Not all plugins require zero-downtime upgrades so this should be an opt-in feature for plugins where the added complexity can be justified. Plugins can opt out of supporting rolling upgrades by checking the progress of their migrations and blocking all API operations until migrations have completed.
Algorithm
To facilitate rolling upgrades, Kibana will maintain the following state for all nodes connected to a Elasticsearch cluster:
Kibana won’t block startup until all migrations are complete. Instead, documents will be migrated asynchronously in the background or when they’re read.
When a new Kibana node comes online it will:
For plugins that don’t opt-in to rolling upgrades:
Open questions: