Add the Ability to Disable certain REST APIs via a Cluster Setting

original-brownbear commented 2 years ago

Description

Elasticsearch contains a number of APIs that can produce very large responses when called in clusters containing a huge number of indices/shards. Examples include:

/_cluster/_state
/_mappings
/_shards
/_shard_stores

(particularly when called with the ?pretty option). These large responses can consume resources on the coordinating node(s) that respond to these APIs in ways that are unacceptable in a production cluster. So far when dealing with these issues we had to resort to tracking down the offending caller making the API calls to a cluster or adjusting the authorization setup to disable an API for a caller in order to stabilize a cluster. The first option is very time consuming and might involve making adjustments to a large number of processes calling an API. The second option is complicated and comes with a number of limitations depending on the exact role setup of a deployment.

-> we discussed this in the many-shards sync and decided we'd like to add a cluster setting that allows turning off REST APIs by path so that a cluster can be stabilized right away once the offending API has been identified.

I would suggest the cluster setting:

http.route.disable: ["/_mapping/", "..."]

that takes a list of paths exactly like we already have it in the REST request tracer.

relates #77466

elasticmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

elasticmachine commented 2 years ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

rjernst commented 2 years ago

Wouldn't blocking these APIs cripple the diagnostics tooling?

original-brownbear commented 2 years ago

Wouldn't blocking these APIs cripple the diagnostics tooling?

Yes but that's to some degree the intended purpose here. E.g. The issue with a recent large deployment that motivated this (though this has been a recurring issue) was that some yet to be identified diagnostics script was hitting /_mappings to the point of bringing down all coordinating nodes. If the user needs to run diagnostics they can always (temporarily remove their filter setting) to e.g. the support diagnostics work. If diagnostics break the cluster on the other hand, this is a quick-fix to save production. Another example use case around support diagnostics in particular that I can see is clusters that got so large that /_cluster/state broke for them and would result in a support diags run breaking the cluster. Being able to selectively turn off APIs that never respond but always break the cluster would be quite valuable to at least get some diagnostics by disabling the state endpoint (obviously this could be had by enhancing the diags tool as well but that adds even more complexity IMO).

I should point out that this feature is not meant as a replacement for improving the stability of these APIs, it's merely to give us a better way of helping large deployments in the short-term and to have some insurance against unforeseen edge cases (e.g. #84266 creating an absurd response size on the ILM explain API and that causing trouble once someone finds their way to the Kibana page that calls the API).

rjernst commented 2 years ago

My concern is that such a crude hammer will allow cause users to break their own thumbs. I can already imagine issues where some API isn't responding as expected, and then finding they left one of these set, or tried blocking more than they should.

Does the api have to be general, or could it have a specific list of possible apis that can be blocked?

original-brownbear commented 2 years ago

Does the api have to be general, or could it have a specific list of possible apis that can be blocked?

That was my thought as well. Other's made a IMO very good point though that this leaves the possibility of missing an edge case or a new API ending up not scaling well unexpectedly. So this seems more useful if it's very flexible.

My concern is that such a crude hammer will allow cause users to break their own thumbs.

Maybe, but I think this wouldn't be used by anyone unless they ran into trouble with an API. It's hard to imagine this creating hard to resolve problems (so long as we don't allow blocking the settings update I guess :)) because any error message returned by blocked APIs would be quite obvious and inspecting the cluster settings also makes it trivial to understand the situation we're in with a cluster?

rjernst commented 2 years ago

inspecting the cluster settings also makes it trivial to understand the situation we're in with a cluster?

Unless we can't return the cluster settings. :)

As long as we put protections in for reading and writing settings, I guess this is ok.

DaveCTurner commented 2 years ago

Crude but IMO better than the alternative we have today: we can't block stuff even when desperate, despite blocking stuff being the obvious bandage to apply. If it's actually needed then that's a bug, but at least with this we have some chance of keeping things running until we land a fix.

I agree that it's vital to avoid confusion about what's going on with a cluster with this kind of block in place, and also I'd like us to make it clear that it's a temporary thing rather than a state we expect clusters to be in when working normally. The response needs to be super-clear, and I suggest we also emit a WARN log every few minutes if anything in this area is configured.

Also we need to be 110% certain that it's always possible to release the block. Could it be a Custom in the ephemeral cluster state rather than a (persistent) setting? That way a full cluster restart would drop it. Or we could have an allow-list read from elasticsearch.yml which overrides the cluster-level block-list. Or maybe a way to bypass the block with a randomly-generated header (reported in the logs but not in any API, so that a human has to be in the loop).

elastic / elasticsearch

Add the Ability to Disable certain REST APIs via a Cluster Setting #84876

Description