Open original-brownbear opened 2 years ago
Pinging @elastic/es-data-management (Team:Data Management)
Related to https://github.com/elastic/elasticsearch/issues/51992 (the issue that https://github.com/elastic/elasticsearch/pull/83832 closed), also related to https://github.com/elastic/elasticsearch/issues/77466
Making a note here of https://github.com/elastic/elasticsearch/pull/82830, too -- IIUC this should have sped up some of the related stats requests significantly.
Okay, a few (non-exhaustive!) ideas for fixing this for us to discuss:
This is the fast-and-easy fix, though we remove the pushback and we'll have to figure out if/how to re-add it in the future. This might be something we do for 8.2, but I don't think we should consider it a permanent fix.
This would allow the ILM-made requests to pass this flag through and not run into the limit. The downside for this is that if we still increment the number of concurrent requests, we could end up causing other stats requests to fail. We don't necessarily have to increment the request counter though, so it could still be an option.
We could handle these in the onFailure
listener for the rollover check, and then do a retry (maybe with a timeout) inside the WaitForRolloverReadyStep
class itself if it got the "too many concurrent stats requests" error. It won't really help too much though, because the first 100 will get in, and then all the others will be sitting there doing retries and we'll still end up not being able to get the request through.
This would let us use the timeout to say, wait 30 seconds on the semaphore for a permit to be available. It essentially limits us to 100 concurrently, but they will all queue up behind one another, and we could end up in the state where the last rollover check waits longer than the timeout and we end up in an error anyway.
Since we're triggering most of these from a cluster state changed event, we could grab the stats once at the beginning of the new state, and pass it in to all of these events (so that they use the same stats). This would entail probably not calling the rollover stuff itself, but factoring that logic out so that ILM could use it directly (passing in the stats info as necessary).
For this, we'd make TransportRolloverAction
use stats for all indices gathered every N seconds, similar to what we do for gathering disk usage. It would then scale independently from ILM to as many indices as we want without having to get the stats on-demand each time. This would add complexity and change the contract for rollover that we currently have though, so I don't think I am in favor of this.
Well, those are a few options to kick off the discussion, I don't think I have a favorite yet, and there are undoubtedly more ways to go about fixing this. Going to /cc @gmarouli also since she worked on the initial PR for limiting this also, in case she has some ideas.
We should probably change the label (and maybe changelog entry) for #83832 -- for example, if we disable this by default with 8.2.0, then we wouldn't want to have the changelog go out as if the feature was included as originally written.
~I tweaked the label back to v8.2.0 because this remains an 8.2.0 blocker (but I'll remove the blocker label and update the version label after https://github.com/elastic/elasticsearch/pull/85504 has been merged).~
I reverted https://github.com/elastic/elasticsearch/pull/83832 from the 8.2 branch via https://github.com/elastic/elasticsearch/pull/85504, and I've updated the version tag here to v8.3.0 rather than v8.2.0. For the moment I think this stays as an 8.3.0 blocker until we further resolve things.
I reverted #83832 from master (before the 8.3 branch was created) via https://github.com/elastic/elasticsearch/pull/87054, so I'm removing the 'blocker' label from this issue.
@DaveCTurner and I discussed this issue again today, and I'd like to add additional possible option onto https://github.com/elastic/elasticsearch/issues/85333#issuecomment-1078138710:
Also:
Adding one more to the mix:
83832 introduced a rate limit of 100 per coordinating node for certain stats requests. This causes issues once ILM starts executing more than 100 of them at a time which it can easily do during a rollover-ready check.
For one, if more than 100 indices need to be checked for rollover readiness during one periodic execution of ILM those in excess of the 100 will be marked as in ILM error. More importantly though, this effectively limits us to checking only 100 indices per ILM trigger period (10 minutes by default) which will cause issues in larger deployments.
I think the easiest solution here might be to disable this new functionality by default to not introduce new functionality that might cause trouble elsewhere into 8.2 given how close the 8.2 FF is.