max_connections is not respected and connections pool grows until connections are closed by idle timeout

DmitryKakurin commented 4 years ago

Versions:

Hackney: 1.16.0 HttPoison: 1.7.0 Elixir/Erlang:

Erlang/OTP 21 [erts-10.3] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]
Elixir 1.10.3 (compiled with Erlang/OTP 21)

Simplest Repro:

Set max pool size to 1 and hit 2 different servers.

Interactive Elixir (1.10.3) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :hackney_pool.start_pool(:default, max_connections: 1)
:ok

iex(2)> HTTPoison.get("example.com")
{:ok,  %HTTPoison.Response{ …

iex(3)> HTTPoison.get("example.net")
{:ok,  %HTTPoison.Response{ …

iex(4)> :hackney_pool.get_stats(:default)
[name: :default, max: 1, in_use_count: 0, free_count: 2, queue_count: 0]

free_countis is 2, but it should not be greater than max, which is 1.

Suspected bug location

I suspect the bug was introduced in this checkin: ec5b90c5d8d8e4e5ecc6e6f0978c416df84940ca Note how the check if PoolSize =< MaxConn was not carried over from handle_call({checkin,... into deliver_socket function for the empty case.

The new check dict:size(Clients) >= MaxConn in handle_call({checkout... ensures that no more than MaxConn clients will be waiting for connection at the same time, but is not sufficient to limit the total pool size.

benoitc commented 4 years ago

Does the requests return to the pool? What dis the Response thing returned ? Does it contains the body?

DmitryKakurin commented 4 years ago

@benoitc the code repro above is literally runnable (example.com and .net are real existing domains). Both requests return 200. Given free_count: 2 I assume yes, connections are returned to the pool.

jdppettit commented 3 years ago

I'm running into this problem as well - any updates on where things are on this?

benoitc commented 3 years ago

There will be a release this WE including a fix for it.

On Thu, Oct 22, 2020 at 6:37 PM Joe Pettit notifications@github.com wrote:

I'm running into this problem as well - any updates on where things are on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benoitc/hackney/issues/648#issuecomment-714616248, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRITYT4U5KT5O35SUGFLSMBNVDANCNFSM4P76ENZA .

sorliem commented 3 years ago

I was bit pretty severely by this bug a couple days ago. Had to disable pooling entirely. Is there a release coming coming out that contains the fix?

benoitc commented 3 years ago

I was bit pretty severely by this bug a couple days ago. Had to disable pooling entirely. Is there a release coming coming out that contains the fix?

What happened. How did you reproduce the issue?

There is a relase that will land today that is changing the way connections are handled, but the problem above shouldn't trigger anything severe if connections are released correctly to the pool.

sorliem commented 3 years ago

I have code in my system using hackney that calls 3rd party services to get data. I saw the :checkout_timeout error explode, added some metrics to monitor the in_use value returned from :hackney_pool.get_stats(:default) and redeployed.

The following screenshot is a new k8s deploy of 10 pods each with 1 default hackney pool running: 2020-12-11-hackney_pool_3 The lines going steadily up is each pod reporting the in_use connections and all are rising to the 200 max connections I configured. Shortly after this time the checkout_timeout errors started showing up.

benoitc commented 3 years ago

I have code in my system using hackney that calls 3rd party services to get data. I saw the :checkout_timeout error explode, added some metrics to monitor the in_use value returned from :hackney_pool.get_stats(:default) and redeployed.

The following screenshot is that new k8s deploy of 10 pods each with 1 default hackney pool running: The lines going steadily up is each pod reporting the in_use connections and all are rising to the 200 max connections I configured. Shortly after this time the connect_timeout errors started showing up.

does your code always ensure to read the body? That said I am working on replacing the pool. The code will be ready for consumption tomorrow. Took more time than expected...

dbhobbs commented 3 years ago

Is reading the body required? I don't see that mentioned in any of the documentation. :thinking:

benoitc commented 3 years ago

Is reading the body required? I don't see that mentioned in any of the documentation. 🤔

Yes to release the socket you need either read the body (which is always done when using the with_body option in the request, or skip it using the skip_body function or similar. Closing the request does also work. If a response is not read completely and the process is still active hackney is actually consider it as an active session for now.

benoitc commented 3 years ago

that should indeed be mentionned by the doc ...

sorliem commented 3 years ago

Oh, ok. We may have some locations where we are not reading the body. Would you be able to point me to the right location in the documentation that says that?

benoitc commented 3 years ago

Oh, ok. We may have some locations where we are not reading the body. Would you be able to point me to the right location in the documentation that says that?

that's not written explicitly in the doc i think. A connection is removed from the pool automatically if the process is closed before releasing it or if the whole body has been read, otherwise (logically I would say) there is no way to know if the connection has to be released.

alexgleason commented 3 years ago

https://github.com/benoitc/hackney/blob/master/NEWS.md

1.17.0

fix memory leak in connection pool

Is this fixed now?

We downgraded hackney to 1.15.2 and wondering if it's safe to upgrade again: https://git.pleroma.social/pleroma/pleroma/-/issues/2101

benoitc commented 3 years ago

watch the release this week-end. Hackney will be bumped to 2.0. This will be announced next week also :)

On Fri, Apr 30, 2021 at 6:33 PM Alex Gleason @.***> wrote:

https://github.com/benoitc/hackney/blob/master/NEWS.md

1.17.0

fix memory leak in connection pool

Is this fixed now?

We downgraded hackney to 1.15.2 and wondering if it's safe to upgrade again: https://git.pleroma.social/pleroma/pleroma/-/issues/2101

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benoitc/hackney/issues/648#issuecomment-830214907, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRIXTRL52IVNG756HNIDTLLLW7ANCNFSM4P76ENZA .

kuznetskikh commented 2 years ago

Hello @benoitc! Any update on this issue?

I'm facing with it on httpoison 1.8.0 and hackney 1.18.0. Hackney still doesn't respect max_connections as in @DmitryKakurin's example.

benoitc commented 2 years ago

@kuznetskikh does httpoison read the body? Body need to be read or skipped to release the connection.

kuznetskikh commented 2 years ago

Hello @benoitc. I'm sorry for late response, it's working fine for me now. I just incorrectly configured hackney pool on application start. Now all is fine, max_connections is taken into play using:

:hackney_pool.child_spec(:main_api_pool, timeout: 15_000, max_connections: 1)

And only 1 connection is available.

Thank you!

benoitc commented 4 months ago

closing the ticket since the issue was resolved. Thank you @kuznetskikh for the feedback and sorry for the late resolution.

benoitc / hackney