clj-commons / aleph

Asynchronous streaming communication for Clojure - web server, web client, and raw TCP/UDP
http://aleph.io
MIT License
2.54k stars 241 forks source link

HTTP client connection pool timeout doesn't free up queue #713

Closed DerGuteMoritz closed 7 months ago

DerGuteMoritz commented 7 months ago

Problem

The pool-timeout option of aleph.http/request controls how long the caller is willing to await a connection to be acquired from the connection pool. When it expires, the response deferred is rejected with a PoolTimeoutException. However, the "acquire" operation will still clog up the pool's queue until it gets its (by then unnecessary) turn.

Reproduction test case (in aleph.http-test):

(deftest test-pool-timeout
  (let [starved-pool (http/connection-pool
                      {:total-connections 0
                       :max-queue-size 1})]
    (try
      (is (thrown? PoolTimeoutException
                   @(http-get "/" {:pool starved-pool :pool-timeout 2})))
      (is (thrown? PoolTimeoutException
                   @(http-get "/"  {:pool starved-pool :pool-timeout 2})))
      (finally
        (.shutdown ^Pool starved-pool)))))

The second assertion fails with:

expected: (thrown?
           PoolTimeoutException
           @(http-get "/" {:pool starved-pool, :pool-timeout 2}))
   error: java.util.concurrent.RejectedExecutionException

Solution

Eagerly removing the acquire operation from the pool's queue would require changing io.aleph.dirigiste.Pool#acquire to support this. Its (internal) queue already supports this via cancelTake, it just isn't reachable from the public API.

Alternatively: Perhaps the current behavior is fine / good enough? It just happened to surprise me but that could also be addressed with a docstring change.

KingMob commented 7 months ago

Roping in @alexander-yakushev in case they have a comment.

It would be nice to cancel the pending acquire if it's taken too long, but it would also be fine w me to xform the rejected ex into a pool timeout ex.

If you exhaust all conns in reality, you're going to get exceptions of some sort.

alexander-yakushev commented 7 months ago

I have a few comments about this indeed.

I found that in practice (in my work, I can be biased), Aleph's approach to having a separate queue for acquiring connections is unnecessary. Caching connections is vital because you can't keep creating and disposing them at a high rate (due to TIME_WAIT), but limiting the total number of them is not as important. I personally set the connection limit to 16k per host and pool queue size to 0, so that a fresh connection is immediately created for the acquirer each time there are no free connections in the pool, up until the limit is hit at which point it's an error.

There are some cases where you would use the pool with the queue as a parallelism controller. I think all those cases could be rewritten to something like Semaphores if Aleph connection pools offered no queueing.

So, this is my take. In my world, the breaking Aleph2 would have connection pools with no acquisition queues. Such pools are much much easier to implement, way more performant at high loads, and give the user one less option and timeout to think about. But I understand that people can have other usecases where this won't fly.

alexander-yakushev commented 7 months ago

Sorry I haven't answered the original question. I think the current behavior is fine. I doubt anybody fine-tunes :pool-timeout and handles it somehow specially, it is an awkward option to give to the user and work with.

DerGuteMoritz commented 7 months ago

Thanks for chiming in @alexander-yakushev, very good insights! I agree that the acquire queue and timeout are quite awkward and needlessly complicate matters. I also agree that limiting parallelism should rather be handled by the caller/application itself. For example, if I'm already using core.async, I'd rather use that instead of having to also deal with RejectedExecutionException and PoolTimeoutException. I wonder whether this must necessarily be considered a breaking change: Most users probably won't notice much of a difference and, as pointed out by @KingMob, they will already have to deal with exceptions due connection exhaustion in some manner anyway. And we could just keep around the old pool implementation (maybe move to a namespace which isn't loaded by default) so users who do rely on that specific behavior can still swap it back in. I filed https://github.com/clj-commons/aleph/issues/716 to track this idea.

As for status quo: Given all of the above, I think it's indeed fine to keep the behavior as-is. Closing but @KingMob feel free to re-open if you'd still like to see a change!