Push Back on Excessive Snapshot Repository API Requests

original-brownbear commented 4 years ago

Currently, requests for the status of snapshots (TransportGetSnapshotsAction as well as TransportSnapshotsStatusAction) can result in long running executions on the generic thread pool. This is especially true for TransportSnapshotsStatusAction (which can easily take multiple minutes to run for Cloud backed repositories and large snapshots). If a client sends a number of these requests at once this can cause a large number of generic pool threads to become busy as well as a lot of heap pressure from that.

One scenario where this could become troublesome would be a client that retries a slow snapshot status request because it times out on the slow API quicker than the requests can finish, adding ever more tasks to the GENERIC pool on the master node. Another possible scenario observed was a user simply sending status requests for multiple snapshots in parallel causing a number of multi-second tasks to run on the master's generic pool at the same time, destabilizing the master node from heap pressure and potentially causing significant latency on the generic pool.

Currently, there is no push-back against a flood of snapshot status requests from a client other than the (real-memory) circuit breaker. Given that it's fairly easy to DOS a master node via TransportSnapshotsStatusAction calls, should we add a mechanism to push back against these to limit how many of these requests we service concurrently?

Similar to https://github.com/elastic/elasticsearch/issues/51992 but affecting the generic pool.

elasticmachine commented 4 years ago

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear commented 4 years ago

We discussed this during out team meeting and decided to limit the number of concurrent snapshot status API requests we allow directly by using a counter. We also discussed the alternative of using another thread-pool for executing these requests. The upsides of thread-pool appraoch would have been:

Removes these long running tasks from the generic pool, hence offering better guarantees for the generic pool
Supports queuing of requests compared to counting the in-flight requests and simply limiting their parallelism
Can be monitored via the thread pool stats APIs to understand how many requests are currently in-flight, have been rejected and so on

The upside of the count in-flight and reject outright approach on the other hand is its simplicity. Limiting the number of concurrent snapshot status requests outright allows us to fix the potential to DOS a master node without changing other characteristics of the thread-pooling system with little effort required for the implementation. The fact that we don't get queuing of request from this approach was not seen as a big issue since these requests are not essential to the functioning of the cluster and it can be assumed that users can retry them easily. Also, these requests are expensive for the cluster to respond to (in terms of IO and time) and supporting heavy load on these APIs is not a priority, making queuing less necessary. The downside of having a queue of these requests would also have been that if clients were to time out and retry on these requests, then a queue of them might mean that the API can become blocked on already timed out (client-side) requests that newly issued request now would have to needlessly wait for. This could be solved by more elaborate queuing that would reject request from the queue after a given timeout or similar measures that would be a lot more complex to implement than just limiting concurrency on these requests. We discussed the lack of monitoring as a potential issue and something to be desirable to have going forward (e.g. by having these long running stats requests show up in the tasks API). Currently, we do not have that monitoring either though so we decided that this can be revisited at a later date.

elastic / elasticsearch

Push Back on Excessive Snapshot Repository API Requests #55153