elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.91k stars 24.73k forks source link

Remove lenience in CCS version compatibility #85456

Open DaveCTurner opened 2 years ago

DaveCTurner commented 2 years ago

Cross-cluster search has stricter version compatibility constraints than other features involving remote clusters. For instance, Elasticsearch 8.1 can have 7.17 remote clusters but CCS is not supported on this pair of versions. In practice CCS works ok between 8.1 and 7.17 but we aren't testing this and therefore expect some future 8.x version to become CCS-incompatible with 7.17. It's not clear how such an incompatibility would manifest, but I would guess it could be a wire-format deserialisation failure which is typically painful to track down and causes the whole transport connection to close (see also https://github.com/elastic/elasticsearch/issues/75334).

I think we should remove this lenience and actively reject CCS requests to incompatible versions as per the documented compatibility matrix to avoid such a future surprise.

elasticmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

javanna commented 2 years ago

To clarify, the reason why there would be compatibility issues is once the newer version has a feature that the older version does not support. That happened in the 7.x series for instance when we introduced the fields parameter in the search request, as well as runtime fields. I have been wondering too if we should be stricter and enforce our policy, that would be perceived as a breaking change I'm afraid from users that are currently relying on this at their own risk. We will discuss it with the team.

DaveCTurner commented 2 years ago

I have been wondering too if we should be stricter and enforce our policy, that would be perceived as a breaking change I'm afraid from users that are currently relying on this at their own risk.

IMO that would be preferable to inadvertently introducing a breaking change in this area in some future version and having to deal with the fallout reactively. Users might be relying on this lenience at their own risk, but I bet many of them don't even know that they're taking such a risk.

jpountz commented 2 years ago

I'll try to share some context that hopefully helps understand how we ended in the current situation.

Elasticsearch would previously accept to search remote clusters between versions {X-1}.latest and version {X+1}.latest. This worked fine, you only had to be careful to use features that were supported by all versions of your remote clusters. For instance if you tried to use a feature that was introduced in version X.0, there are chances that the {X-1}.latest cluster would return failures because it doesn't know how to do it. This is a limitation, but it doesn't really come as a surprise, right?

The problem arises when thinking of cross-cluster search at the Kibana+Elasticsearch level. Kibana features don't have a 1:1 mapping with Elasticsearch features. For instance if Kibana moves from aggregations to the terms enum API for field value suggestions, or starts using _pit in Discover, Kibana users do not feel like they are using a brand new feature, they are still using the good old field value suggestion feature and Discover app that they have known almost forever.

Since we didn't want to have a different compatibility matrix at the Elasticsearch and at the Kibana level, we agreed to reduce the compatibility window of CCS to the previous minor, so that there would be a single compatibility matrix across Kibana and Elasticsearch, and Kibana could use modern Elasticsearch features (any feature that was available in the previous minor) without worrying about remote clusters.

The question of whether compatibility limitations should be enforced or just documented came up. It was noted that searching old remote clusters had been used successfully by some (of our larger) users, so there was strong hesitation to remove this capability completely as opposed to making it unsupported as a start. Cross-cluster search with old versions is not tested explicitly, but there is a significant overlap in terms of code paths with mixed-version clusters, where we still support having nodes on version {X-1}.latest and X.Y in the same cluster. This makes a strong assumption that CCS and mixed-version clusters keep using similar code paths in the future however.

That context being set, I think it's good to revisit this stance. There are definitely downsides to leniency.

DaveCTurner commented 2 years ago

Thanks @jpountz. I still think it's bad to have a feature in this limbo state where it's technically unsupported and not covered by tests, but we still believe it works fine in practice for now with the only information that might tell users about the risks they're taking buried deep in the reference manual. Intra-cluster search is already not completely the same as cross-cluster search (e.g. ccs_minimize_roundtrips doesn't make sense within a single cluster) so I don't think we get sufficient coverage from mixed-cluster tests.

I strongly prefer not to be lenient here, but I see why we might want more nuance than simply rejecting requests as I originally suggested. If we expect users to want to use these unsupported configs we should make it clearer that they're using something unsupported somewhere that stands out more than just here in the manual, e.g. via the deprecation logging mechanism. AIUI we don't emit any such logging or warning headers today. I'd also like us to reinstate the tests of these things so that we make a more conscious decision to introduce any future breaking change, which would mean we can take steps to provide the best possible feedback to users that they have encountered such a break.

jpountz commented 2 years ago

cc @giladgal

javanna commented 2 years ago

An additional data point, while we currently don't have CCS tests across multiple versions at all, we are working on adding them as part of #84481. This will expand our test coverage and cover for situations that make field_caps and search work differently in the cross-cluster scenario. We are though limiting this to the previous minor according to the updated CCS matrix compatibility.

k-saket commented 2 years ago

I think having an idea on how many users/clusters rely on this "unsupported" behaviour (i.e., are issuing CCS to a version older than the previous minor) would help. We are planning to add some telemetry on the usage of CCS, which should help us make a call whether to enforce the policy strictly (if only a few clusters rely on this behaviour and we can get them to upgrade) or start with deprecation notices & proactive outreach instead.

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)