Closed Szetty closed 2 months ago
No idea of what is the root cause based on the description. I would need a mechanism to reproduce it to be able to investigate it.
Unfortunately we have no ways to reproduce, because they are pretty rare, let me try to setup some tracing and come back with my findings.
So I have setup some tracing on NimblePool
using Rexbug and added some logs using the Finch telemetry for request start and queue stop (when checkout is done). I am sending both files here. In the logs file we can see that there are two periods when NimblePool stops responding for a while (this is when we receive the timeouts too, unfortunately the error out is not in this file, but usually the first request ids will time out, this can be seen in the logs by not having the "checkout completed" log for that request id). I could relate these periods to :DOWN events in the trace file, although they are not 1:1 (there are 3 DOWN events, but 2 periods with timeout). I will investigate it more and understand NimblePool better (especially what happens when these DOWN events happen, what could make NimblePool so busy?), but in case this is enough information to have an idea what to look out for, I would welcome it.
What is the throughput of this pool? Maybe check if increase the pool count would help
On high throughput pools nimble pool can become a serialization bottleneck and finch's pool count is useful in situations like this.
Another important point to notice is that finch's pool is lazy, so from finch docs https://hexdocs.pm/finch/Finch.HTTP1.PoolMetrics.html#summary:
A given number X of available_connections does not mean that currently exists X connections to the server sitting on the pool. Because Finch uses a lazy strategy for workers initialization, every pool starts with it's size as available connections even if they are not started yet. In practice this means that available_connections may be connections sitting on the pool or available space on the pool for a new one if required.
I bring this because what could be happening is some sort of problem on the upstream service that leads to problems on worker (connections) initializations that may look like a nimble pool problem on checkout.
What is the throughput of this pool? Maybe check if increase the pool count would help
On high throughput pools nimble pool can become a serialization bottleneck and finch's pool count is useful in situations like this.
Throughput is not that big, because the requests are pretty long (can be tens of seconds to couple of minutes)
Another important point to notice is that finch's pool is lazy, so from finch docs https://hexdocs.pm/finch/Finch.HTTP1.PoolMetrics.html#summary:
A given number X of available_connections does not mean that currently exists X connections to the server sitting on the pool. Because Finch uses a lazy strategy for workers initialization, every pool starts with it's size as available connections even if they are not started yet. In practice this means that available_connections may be connections sitting on the pool or available space on the pool for a new one if required.
I bring this because what could be happening is some sort of problem on the upstream service that leads to problems on worker (connections) initializations that may look like a nimble pool problem on checkout.
I can easily imagine problems from the called service, it has always been a tad bit unstable, but why does this affect NimblePool so heavily ? I feel problems with connections, or unexpected responses should not affect the pool this way (big slowdowns).
I have investigated more, and found out that it could be related to a Mint socket, I am attaching some traces demonstrating the behaviour.
I have followed and as all the data suggests problem is with closing the SSL connection, more exactly in :ssl_gen_statem
, any ideas why closing might take so long time (~ 2 minutes, when 5 seconds timeout is specified), how to improve it or who to contact ?
I can easily imagine problems from the called service, it has always been a tad bit unstable, but why does this affect NimblePool so heavily ? I feel problems with connections, or unexpected responses should not affect the pool this way (big slowdowns).
Some pool callbacks are executed inside the pool process which may lead to unresponsivess. If you are having problems with worker terminations (closing the connection) this may be the root cause to the pool unresponsivess since the terminate_worker may be called inside the pool process.
I think the quick way to solve this for you is to instead of closing the connection on terminate_worker/3
you just spawn a process that will do it. But I think the root cause is the closing connection timeouts.
Either way IMHO the issue is related more with Finch/Mint and not NimblePool it self. Maybe we could offer an async worker termination option but the clients would have to optin anyway.
Couldn’t finch in this case spawn a separate process for termination as it prefers instead of depending on NimblePool?
I agree that the issue is more related to Finch/Mint, or maybe even deeper (like in Erlang SSL), but I will add more tracing and investigate more to find exactly what is causing the issue. Maybe having a separate process makes sense in NimblePool or Finch, but we will use it only if we are sure there is no other solution and that we are not hiding a different problem in our use case.
Couldn’t finch in this case spawn a separate process for termination as it prefers instead of depending on NimblePool?
Yes! I think this is easiest thing to do indeed. Spawning a process is very easy that I think clients can already achieve this behaviour without any changes on nimble pool.
Perfect, thanks for the help @oliveigah. Closing this then. I recommend adding an option to Finch for async termination to give it a try.
Hello, we are using Tesla with Finch for our HTTP requests. Finch uses NimblePool to apply pooling for connections. We have observed that for one of the type of requests we have, we frequently get "out of connections" errors. After analyzing further, improving configuration and adding more observability, we improved these error considerably, but rarely we still have this problem (only for this type). I will provide some log information, which we gather using Telemetry and Sentry:
I have replaced some not so relevant values with variable names, and I have removed the sensitive part. As I interpret it, the pool still has 190ish connections available, but for some reason it still times out (configured at
[receive_timeout: 180000, pool_timeout: 10000]
).Do you have any ideas what might cause this and how we can improve ?