Get Index API returns no indices for a brief amount of time until indices are recovered into cluster_state

elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine

https://www.elastic.co/products/elasticsearch

Other

69.94k stars 24.74k forks source link

Get Index API returns no indices for a brief amount of time until indices are recovered into cluster_state #90215

Closed rudolf closed 1 year ago

rudolf commented 2 years ago

Tested on 8.5.0-SNAPSHOT but I've also observed this behaviour on 8.4.1 clusters on Cloud.

1. Create a lot of indices

curl -s -k -X PUT "http://elastic:changeme@localhost:9200/test[1-500]" -H 'Content-Type: application/json'

2. Create a service token (seems to avoid 401 errors while .security is being restored?)

bin/elasticsearch-service-tokens create elastic/kibana mytoken

3. Start Elasticsearch and start polling the Get Index API:

for i in {1..10000}
do
    curl -k -w "\n" -X GET "http://localhost:9200/.kibana,.kibana_8.4.1_001?ignore_unavailable=true&features=aliases" -H 'Content-Type: application/json' -H 'Authorization: Bearer your_service_token'
done

Results:

curl: (7) Failed to connect to localhost port 9200 after 2 ms: Connection refused

...

curl: (7) Failed to connect to localhost port 9200 after 3 ms: Connection refused

curl: (7) Failed to connect to localhost port 9200 after 3 ms: Connection refused

{}        // <-- One response suggesting .kibana_8.5.0_001 does not exist
{".kibana_8.5.0_001":{"aliases":{".kibana":{"is_hidden":true},".kibana_8.5.0":{"is_hidden":true}},"mappings":{},"settings":{}}}
{".kibana_8.5.0_001":{"aliases":{".kibana":{"is_hidden":true},".kibana_8.5.0":{"is_hidden":true}},"mappings":{},"settings":{}}}
...

When Kibana is started at the same time as Elasticsearch, this behaviour causes Kibana to believe that there isn't any existing indices to migrate from. Kibana therefore attempts to initialize the cluster as a new cluster which can lead to a failed upgrade migration and in some circumstances data loss.

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-security (Team:Security)

DaveCTurner commented 2 years ago

The get-index action itself checks that the cluster is properly recovered before responding:

https://github.com/elastic/elasticsearch/blob/e15d5221e7634c0d989bf66f23ab6891fce8f15c/server/src/main/java/org/elasticsearch/action/support/master/info/TransportClusterInfoAction.java#L50-L54

However, the security layer uses a different cluster state to resolve the indices, and it does this first so it may observe an unrecovered state here:

https://github.com/elastic/elasticsearch/blob/adf8e01286023407e1023cd6fa26ec0228127d1a/x-pack/plugin/security/src/main/java/org/elasticsearch/xpack/security/authz/AuthorizationService.java#L415

albertzaharovits commented 2 years ago

I think David's assesment of the core issue causing the bug is correct.

Just to clarify the problem statement, is this correct @rudolf ? ES with Security enabled temporarily responds that indices do not exist while the cluster state is being recovered.

The bug fix probably needs to implement a similar wait-for-state-recovery verification in the Security filter, but I'm not yet sure if the check should run in all cases (ie for all actions on indices).

rudolf commented 2 years ago

ES with Security enabled temporarily responds that indices do not exist while the cluster state is being recovered.

Correct.

The first thing Kibana does is use the get indices API, but it also has many other interactions with cluster state like adding write blocks, cloning an index and adding/removing aliases through the aliases API. I don't know enough of the architecture to say what the implications might be, but I suspect e.g. cloning an index might also fail if the cluster state hasn't been recovered and the index is deemed to not exist. And different nodes could be in different phases of recovery.

So broadly speaking, as a consumer of Elasticsearch, I would expect it not to process any requests (including search and/or document API requests) before it's ready to give a "correct" answer.

albertzaharovits commented 2 years ago

I got to spend some more time on this today.

The Security request filter modifies the index name list of requests. It does so even if the cluster is not recovered yet (or, more generally, if there is a cluster metadata_read block). The downstream core code will use the modified indices list, wait until the cluster is recovered, and use the modified list. The problem is that the list would be different if Security waited for the cluster to recover in the first place.

Security modifies the indices lists in requests in order to replace wildcards and, pertinent to the case at hand, to remove missing and unauthorized names when the ignore_unavailable request option is true https://github.com/elastic/elasticsearch/blob/30185053bf2511c5566a73871bf734e984e5dfe8/server/src/main/java/org/elasticsearch/cluster/metadata/IndexAbstractionResolver.java#L138

I think the best fix to implement is to error the REST request, from Security, if the cluster is not recovered yet (more specifically if there is a metadata_read type of block).

This sounds like a downgrade from when Security is disabled (because in that case we wait for the cluster to recover), but it's not quite so bad in practice.That's because we already reject requests during authentication and role resolving if the .security index is not available (index shards start to be allocated only after the cluster is recovered). So, the cases where requests will start to be rejected are for clients using non-indexed credentials (eg service accounts, file-based and reserved users) AND with built-in or file-based role mapping and role descriptors (all which are not indexed) - one practical setup example that I can think of is LDAP/AD with file-based role-mapping and roles.

Alternatives that I've considered are:

wait rather than immediately reject the request; the problem is that I don't know how long to wait for (Core code knows because it has more context, ie the request type)
do not remove missing names (only unauthorized) when when ignore_unavailable is true; this doesn't solve it for wildcards, ie wildcards will still be expanded to nothing when the cluster is recovering.

Any input will be very appreciated @DaveCTurner @ywangd and @jakelandis .

jakelandis commented 2 years ago

Will the work to move to single strategy for index name resolution (between core / x-pack) to resolve this issue. i.e. after moving there won't be difference since it will always wait to recover ?

I assume that a metadata_read block means that we shouldn't read the index meta data when that block is present, so raising an error or waiting makes sense. I think ideally we would wait instead of error since that is more aligned with other actions. Is core waiting due to the master node time out associated with the master action ? If so, then we could replicate that wait in the security filter. If the current cluster state has the block and is a master action then retry with a new view of cluster state until the master time out expires or the block clears. I am bit outside my comfort zone here, so I don't give my opinion too much merit.

ywangd commented 2 years ago

the problem is that I don't know how long to wait for (Core code knows because it has more context, ie the request type)

Could you please give me a few pointers (e.g. link to code) on how core has more context and why request type does not work for the security area of code? I think it might be fine to error, but I'd like to understand better why wait is not an option (or hard). Thanks!

rudolf commented 2 years ago

pertinent to the case at hand, to remove missing and unauthorized names when the ignore_unavailable request option is true

So would a workaround for Kibana be to not specify ignore_unavailable=true?

albertzaharovits commented 2 years ago

Will the work to move to single strategy for index name resolution (between core / x-pack) to resolve this issue. i.e. after moving there won't be difference since it will always wait to recover ?

No, I don't think that's the case. The problem here is not that the resolution implementatin is different, which will be resolved when there would be a single implementation. The problem is that the index resolution runs at different times during the request processing lifecycle. When Security is enabled it runs on the coordinating node before the request is handled on that node. When Security is disabled it runs on the master node, while the request is being handled (or just before it, technically), see: https://github.com/elastic/elasticsearch/blob/bfccd20155e2244829892c8b309445d83ec27c81/server/src/main/java/org/elasticsearch/action/support/master/info/TransportClusterInfoAction.java#L52 In this context, the master node has logic to defer executing the action until no more cluster/index-level blocks are set: https://github.com/elastic/elasticsearch/blob/30e6fde1be43bd21e8fe300869c3c808afc5fad4/server/src/main/java/org/elasticsearch/action/support/master/TransportMasterNodeAction.java#L200

we could replicate that wait in the security filter. If the current cluster state has the block and is a master action then retry with a new view of cluster state until the master time out expires or the block clears.

Yes technically we could replicate it but looking at the code in classTransportMasterNodeAction#AsyncSingleAction the retry logic looks to me very specific to a master node's internal dwellings and I think it would be clumsy to extract that in the Security request filter. I think it might technically work (maybe @DaveCTurner will contradict me) but I'm having a hard time thinking of the interface for the method that we should factor out and reuse on the Security filter.

Could you please give me a few pointers (e.g. link to code) on how core has more context and why request type does not work for the security area of code?

https://github.com/elastic/elasticsearch/blob/30e6fde1be43bd21e8fe300869c3c808afc5fad4/server/src/main/java/org/elasticsearch/action/support/master/TransportMasterNodeAction.java#L175

Is the class that runs on the master node and implements the retry logic for actions executed there. It is my impression that exposing that in the request filter to be run on the coordinating node will be clumsy (we have to check the action/request type, and find a way to factor out the block-retry logic).

So would a workaround for Kibana be to not specify ignore_unavailable=true?

I think that would be a workaround yes. It is unfortunate that the Security filter removes authorized but missing names from a request when ignore_unavailable=true. I think we can fix this.

To recap, I think we have a couple of options to consider. I can PoC any/all of them if that would help us to make a decision.

In the Security filter, reject the request with 500/SERVICE_UNAVAILABLE, when there is the metadata_read cluster-level block, like in the case where authn or authz needs something from the .security index which is not available yet.
In the Security filter, retry the request, in the same case of a metadata_read cluster-level block, which requires exposing master node's retry logic to be executed on coordinating nodes.
Make the Security filter NOT remove authorized names for missing indices when ignore_unavailable=true; this will fix this bug issue specifically, but if instead the request used wildcards rather than explicit names, it would still be possible to erroneously report that no such indices exist when in fact the cluster state is not yet recovered.

jakelandis commented 2 years ago

I looked at this in a bit more depth today...

In the Security filter, reject the request with 500/SERVICE_UNAVAILABLE, when there is the metadata_read cluster-level block, like in the case where authn or authz needs something from the .security index which is not available yet.

If we do this then we mostly defeat the requested master timeout for cases where there is a metadata_read block. This would impact any master node action, for which there are many. The coordinating node, and then each action (on each node) could throw an exception based on it's view of the cluster before it ever got to attempt to execute the master action and this could have wide sweeping implications. We really only care about the master's view of the cluster to service master node requests, so I don't think we want the authorization code to throw here.

In the Security filter, retry the request, in the same case of a metadata_read cluster-level block, which requires exposing master node's retry logic to be executed on coordinating nodes.

I looked to see how hard this is .. and assuming my untested POC is mostly correct, it's not too bad. Mostly a copy/paste from the master node action to setup an cluster observer. However, this has the down side of mostly moving the waiting logic to the security code base and introducing multiple waits ... i.e. if we wait here too, is possible to wait master_timeout * N where N is how times authorization executes + the master node. This also defeats my earlier comment - we really only care about the master's view of the cluster to service master node requests.

Introducing this here also begs the question, what about non-master requests, should we fail authorization for other non-master requests, such as search requests if the metadata_read block exists ? ...probably not. But if you apply the same reasoning that we can fail master node requests, then why not fail other types of requests if we can not make reliable authz decisions?

For these reasons, I don't think we should add the wait here either.

Make the Security filter NOT remove authorized names for missing indices when ignore_unavailable=true; this will fix this bug issue specifically, but if instead the request used wildcards rather than explicit names, it would still be possible to erroneously report that no such indices exist when in fact the cluster state is not yet recovered.

I think this would be my preference. It's kinda hard to reason about how the authz code will react for indices with this block across all nodes performing authorization with different views of the state. There might be more manifestations of similar issues but the will be they should be very rare, short lived, and very specific. It is probably best to handle them one at a time if they are discovered.

ywangd commented 2 years ago

Thanks @albertzaharovits for providing extra details and @jakelandis for the in-depth analysis.

The problem is the execution order between "transport action" and "resolving concrete indices". When security is disabled, transport action is invoked first. If the transport action has any special logic of handling cluster state blocks, it has the chance to kick in before resolving concrete indices. But when security is eanbled, the order is reserved. Therefore transport action's special logic does not work anymore because concrete indices are already resolved. Hence this problem is not really about coordinating vs master node. For example, when security is disabled, the retry logic can also execute on coordinating node for requests with local=true. Or when security is enabled, I believe the same issue (GetIndex) can happen for single-node clusters as well.

Based on the above, the ideal solution would be making security's "resolving concrete indices" lazily so it is only executed when transport actions actually need it. More concretely, this means hook security into methods like IndexNameExpressionResolver.concreteIndexNames. I think we talked about the approach at the beginning of the authZ performance project and decided it is too much of a change? This particular issue is unlikely to alter that decision for the time being. So we need something else.

Security has the intention to continue working with a broken cluster so that some rescue actions can be performed. Either throwing error or retry on cluster blocks will probably defeat this purpose. The request may still fail after security because the cluster is broken. But I don't think security should fail it on its own. Security decisions have always been more node local.

Introducing this here also begs the question, what about non-master requests, should we fail authorization for other non-master requests, such as search requests if the metadata_read block exists ?

I think No. Core does not check cluster blocks for search. This retry on cluster blocks logic really is masteNodeAction specific. If we do that in security, it turns the question into "What should security behave on cluster state blocks?" which is no longer the same as the original issue. The current answer to this question is basically "it does not matter". It may not be the best answer. But anything else will be a behaviour change.

Make the Security filter NOT remove authorized names for missing indices when ignore_unavailable=true; this will fix this bug issue specifically, but if instead the request used wildcards rather than explicit names, it would still be possible to erroneously report that no such indices exist when in fact the cluster state is not yet recovered

Does this re-introduce a difference between security and core? If so, I don't think we should do it.

I'd like to propose leaving the choice to the caller. Let security code issue a warning header when indices are resolved with cluster blocks. So the caller has the explicit knowledge that the request is handled with some limitation and can decide whether to retry or just bail (upon seeing the warning). The caller in this case can be Kibana. But it can also be internal, e.g. some subclass of AbstractNodeClient if we want to automate it on the server side. Though I'd start simple with just external callers.

It might even be possible to do nothing and ask end-users, e.g. kibana, to use ignore_unavailable=false as a workaround. Depending on our long term roadmap for authZ (e.g. making security hook into IndexNameExpressionResolver) and how fast we can get onto it, "do nothing" may actually be an acceptable solution.

albertzaharovits commented 2 years ago

Thank you Jake and Yang for the collaboration on this one.

Given your input I'm now convinced that waiting or erroring in the Security filter (the first two options) for the cluster-level metadata_read block are NOT viable options, at least at the moment.

Initially, I preferred the option to error the request if the metadata_read cluster-level block was set but Jake's argument that ILM or other internal jobs, which reuse the authentication of the job's creator, but which are nevertheless authorized and are potentially impacted by this change clicked with me. It made me realize the proposed change has broader impact (the alternative to wait for the block to be lifted has equivalent broader impact). To Yang's point, authz is assumed to broadly work even if the cluster is recovering, has blocks, etc.

The last option has the advantage that it doesn't have a broad impact. But it also doesn't solve the general problem, because wildcards will still be expanded to empty (or a subset?) when the cluster state is being recovered. But this sort of problem is inherent in the current design which performs authorization given the local cluster state on the coodinating node.

@ywangd At some point we're going to have a single implementation of the expression resolver, that in the current design is going to be called by the Security filter on the coordinating node, parameterized for the user's permissions. It's still an option that we then decide to move the invocation on the data/master nodes, and away from the coordinating node, in order to fix this sort of bugs. But there are drawbacks to this strategy, some of which are: it runs counter to the performance objective (ie do authz once on the coordinating vs. multiple times on the data/master), the auditing system relies on names to be expanded early (ie on the coordinating node), if Security configuration is different across nodes it will become a bigger problem than it is today; so that decision will not be easy.

@rudolf The workaround for Kibana is to set ignore_unavailable=false or leave it to default (which is also false for that API). On the Kibana side, you'll now have to handle an "index not found" exception instead of an empty response. But it is now guaranteed that you'll not get such an exception if the index exists but the cluster state is being recovered.

@jakelandis @ywangd It is a problem that the Security filter rewrites the request to remove names for resources that do not exist when ignore_unavailable=true. It shouldn't do that if the name is authorized, because Core code will handle this case correctly. My takeaway from the above conversation is that fixing this is sufficient to consider this dealt with. The fix is not difficult in the current codebase, but given that we want to unify the two expression algorithms it's going to require some planning. I'll raise an issue for it.