original-brownbear commented 2 years ago

83832 introduced a rate limit of 100 per coordinating node for certain stats requests. This causes issues once ILM starts executing more than 100 of them at a time which it can easily do during a rollover-ready check.

For one, if more than 100 indices need to be checked for rollover readiness during one periodic execution of ILM those in excess of the 100 will be marked as in ILM error. More importantly though, this effectively limits us to checking only 100 indices per ILM trigger period (10 minutes by default) which will cause issues in larger deployments.

I think the easiest solution here might be to disable this new functionality by default to not introduce new functionality that might cause trouble elsewhere into 8.2 given how close the 8.2 FF is.

elasticmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)

joegallo commented 2 years ago

Making a note here of https://github.com/elastic/elasticsearch/pull/82830, too -- IIUC this should have sped up some of the related stats requests significantly.

dakrone commented 2 years ago

Okay, a few (non-exhaustive!) ideas for fixing this for us to discuss:

Disable the limiting for this API entirely.

This is the fast-and-easy fix, though we remove the pushback and we'll have to figure out if/how to re-add it in the future. This might be something we do for 8.2, but I don't think we should consider it a permanent fix.

Add a special flag that makes requests not subject to the 100-max limit.

This would allow the ILM-made requests to pass this flag through and not run into the limit. The downside for this is that if we still increment the number of concurrent requests, we could end up causing other stats requests to fail. We don't necessarily have to increment the request counter though, so it could still be an option.

Handle rejections in the ILM action itself.

We could handle these in the onFailure listener for the rollover check, and then do a retry (maybe with a timeout) inside the WaitForRolloverReadyStep class itself if it got the "too many concurrent stats requests" error. It won't really help too much though, because the first 100 will get in, and then all the others will be sitting there doing retries and we'll still end up not being able to get the request through.

Add the ability to "wait-for-an-empty-slot" for the stats request.

This would let us use the timeout to say, wait 30 seconds on the semaphore for a permit to be available. It essentially limits us to 100 concurrently, but they will all queue up behind one another, and we could end up in the state where the last rollover check waits longer than the timeout and we end up in an error anyway.

Implement check-rollover-ready within ILM differently.

Since we're triggering most of these from a cluster state changed event, we could grab the stats once at the beginning of the new state, and pass it in to all of these events (so that they use the same stats). This would entail probably not calling the rollover stuff itself, but factoring that logic out so that ILM could use it directly (passing in the stats info as necessary).

Make TransportRolloverAction divorced from synchronous stats collection.

For this, we'd make TransportRolloverAction use stats for all indices gathered every N seconds, similar to what we do for gathering disk usage. It would then scale independently from ILM to as many indices as we want without having to get the stats on-demand each time. This would add complexity and change the contract for rollover that we currently have though, so I don't think I am in favor of this.

Well, those are a few options to kick off the discussion, I don't think I have a favorite yet, and there are undoubtedly more ways to go about fixing this. Going to /cc @gmarouli also since she worked on the initial PR for limiting this also, in case she has some ideas.

joegallo commented 2 years ago

We should probably change the label (and maybe changelog entry) for #83832 -- for example, if we disable this by default with 8.2.0, then we wouldn't want to have the changelog go out as if the feature was included as originally written.

joegallo commented 2 years ago

~I tweaked the label back to v8.2.0 because this remains an 8.2.0 blocker (but I'll remove the blocker label and update the version label after https://github.com/elastic/elasticsearch/pull/85504 has been merged).~

I reverted https://github.com/elastic/elasticsearch/pull/83832 from the 8.2 branch via https://github.com/elastic/elasticsearch/pull/85504, and I've updated the version tag here to v8.3.0 rather than v8.2.0. For the moment I think this stays as an 8.3.0 blocker until we further resolve things.

joegallo commented 2 years ago

I reverted #83832 from master (before the 8.3 branch was created) via https://github.com/elastic/elasticsearch/pull/87054, so I'm removing the 'blocker' label from this issue.

joegallo commented 2 years ago

@DaveCTurner and I discussed this issue again today, and I'd like to add additional possible option onto https://github.com/elastic/elasticsearch/issues/85333#issuecomment-1078138710:

Ignore the limit for stats requests that are scoped to a single index. At least in the many shards benchmark cases, it doesn't appear that running tons of these simultaneously ends up being a problem (note: double check this, though!), so we could simply allow them through and not have them count towards the limit.

DaveCTurner commented 2 years ago

Also:

Avoid making thousands of individual stats requests while executing check-rollover-ready steps. We could for instance introduce a bulk dry-run rollover API which makes a single stats call for all the target indices at once.

henningandersen commented 2 years ago

Adding one more to the mix:

Move the determination of when to rollover to primary shard(s). This introduces a slight imprecision when rolling over on total doc count or total index size (I imagine simply dividing by number of shards as the per shard limit). But would be very efficient and could serve as foundation for pushing back on indexing when a rollover does not happen in a timely manner.

elastic / elasticsearch

Pushing back on index stats requests can cause ILM rollover-ready checks to pile up #85333

83832 introduced a rate limit of 100 per coordinating node for certain stats requests. This causes issues once ILM starts executing more than 100 of them at a time which it can easily do during a rollover-ready check.