spawn_link in hackney_pool issue

leonardb commented 3 years ago

With the change in https://github.com/benoitc/hackney/pull/674 to spawn_link, the caller process will receive the exit message.

If the caller is trapping exits this can be quite unexpected and problematic.

This was quite a surprise when I upgraded as I was only expect exit signals from a potential crash in RMQ connections being handled in the same server. In which case the gen_server would re-initialize.

Since the caller has no access to the Pid of the spawn process it's not possible to cleanly handle this in the caller and the EXIT signal will always propagate.

A possible solution may be using a monitor instead of a link

checkout(Host, Port, Transport, Client) ->
  Requester = self(),
  Ref = make_ref(),
  Fun =
    fun() ->
      Result =
        try
          do_checkout(Requester, Host, Port, Transport, Client)
        catch _:_ ->
          {error, checkout_failure}
        end,
      exit({normal, {checkout, Ref, Result}})
    end,
  {ReqPid, ReqRef} = spawn_monitor(Fun),
  receive
    {'DOWN', ReqRef, process, ReqPid, {normal, {checkout, Ref, Result}}} ->
      Result;
    {'DOWN', ReqRef, process, ReqPid, _Error} ->
      {error, checkout_failure}
  end.

benoitc commented 3 years ago

Using spawn_monitor will not solve the issue the change to spawn_link was intended to fix.

I think an appropriate change for this version would be going back to the use of spawn and handle the race condition during the checkout. I will have a look asap

leonardb commented 3 years ago

@benoitc Maybe I'm misunderstanding the entire issue.

Why would it not solve the problem? The receive statement should receive any exit where the spawned process crashes as per the description of the issue in https://github.com/benoitc/hackney/issues/675

In the case of a non-normal exit this should be received by the 2nd clause in the receive and allow a clean return of an error to the caller.

The only real difference is you're not allowing a spurious exit message to propagate out of checkout

Happy to be educated here.

Edit

Went and re-read the initial issue and it seems I totally misunderstood. Specific case was request crashing, not an issue with checkout.

benoitc commented 3 years ago

@leonardb correct, the spawn_link is intended to fix the case when the requester is crashing. The linked process doing the checkout is exiting when it's happening.

benoitc commented 3 years ago

@leonardb #681 should fix your issue.

@ronaldwind can you also test this patch and let me know how it behaves on your side. This should also fix the issue you intended to solve using spawn_link in #674.

benoitc commented 3 years ago

1.17.3 has been released.

ronaldwind commented 3 years ago

@benoitc Too bad, this change reintroduces the problem I had again. I've upgraded to 1.17.3 an ran the snippet from my issue at https://github.com/benoitc/hackney/issues/675, and the in_use_count is once again shows increasing numbers.

benoitc commented 3 years ago

it should reduce though. or maybe it is only a counter not reduced?

On Thu 18 Mar 2021 at 11:01, Ronald Wind @.***> wrote:

@benoitc https://github.com/benoitc Too bad, this change reintroduces the problem I had again. I've upgraded to 1.17.3 an ran the snippet from my issue at #675 https://github.com/benoitc/hackney/issues/675, and the in_use_count is once again shows increasing numbers.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/benoitc/hackney/issues/680#issuecomment-801789906, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRIUF3WZGNIYX3NG3423TEHFQ3ANCNFSM4ZKUW3QA .

-- Sent from my Mobile

ronaldwind commented 3 years ago

The in_use_count is not lowering as time passes. I've waited for 10+ minutes and the count is still the same.

I've also tried filling up the pool to the point in_use_count equals max. At that point no new requests are made and I simply receive a :checkout_timeout:

iex(1) > :hackney.request(:get, "https://github.com/benoitc/hackney", [], "", [])
{:error, :checkout_timeout}

In short, it does not seem to be that the counter is just reporting wrong numbers. And it seems like it is the exact same behaviour as it was in 1.17.0.

benoitc commented 3 years ago

@ronaldwind can you try this patch

diff --git src/hackney_pool.erl src/hackney_pool.erl
index c913998..ba3a739 100644
--- src/hackney_pool.erl
+++ src/hackney_pool.erl
@@ -75,8 +75,8 @@ checkout(Host, Port, Transport, Client) ->
           Requester ! {checkout, Ref, Result};
         false ->
           case Result of
-            {ok, SocketRef, Socket} ->
-              checkin(SocketRef, Socket);
+            {ok, {_Name, ConnRef, Connection, Owner, Transport}, Socket} ->
+              gen_server:call(Owner, {checkin, ConnRef, Connection, Socket, Transport}, infinity);
             _Error ->
               ok
           end

also can you send me a trace. You can do it in private on my mail if you want to

benoitc commented 3 years ago

@ronaldwind i will do a tests later today, but if you have the possibility to do it , please let me know :)

leonardb commented 3 years ago

This patch appears to resolve the issue.

Just for testing I altered the patch so the caller process "deadness" is guaranteed (no timing issues).

diff --git a/src/hackney_pool.erl b/src/hackney_pool.erl index c913998..e0c7101 100644 --- a/src/hackney_pool.erl +++ b/src/hackneypool.erl @@ -70,13 +70,14 @@ checkout(Host, Port, Transport, Client) -> catch :_ -> {error, checkout_failure} end,

exit(Requester, kill), case is_process_alive(Requester) of true -> Requester ! {checkout, Ref, Result}; false -> case Result of
{ok, SocketRef, Socket} ->
checkin(SocketRef, Socket);
{ok, {_Name, ConnRef, Connection, Owner, Transport}, Socket} ->
gen_server:call(Owner, {checkin, ConnRef, Connection, Socket, Transport}, infinity); _Error -> ok end

and tested as follows:

Erlang/OTP 23 [erts-11.1.8] [source] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [hipe]

Eshell V11.1.8 (abort with ^G) 1> application:ensure_all_started(hackney). {ok,[unicode_util_compat,idna,mimerl,certifi,syntax_tools, parse_trans,ssl_verify_fun,metrics,hackney]} 2> hackney_pool:start_pool(test_pool, [{checkout_timeout, 100}, {timeout, 1000}, {max_connections, 10}]). ok 3> [begin 3> P = spawn( 3> fun() -> 3> Res = case hackney:request(get, " https://github.com/benoitc/hackney", [], "", [{pool, testpool}]) of 3> {ok, , , } -> 3> ok; 3> Err -> 3> Err 3> end, 3> io:format("~p : ~p~n", [X, Res]) 3> end) 3> end || X <- lists:seq(1,20)]. [<0.429.0>,<0.430.0>,<0.431.0>,<0.432.0>,<0.433.0>, <0.434.0>,<0.435.0>,<0.436.0>,<0.437.0>,<0.438.0>,<0.439.0>, <0.440.0>,<0.441.0>,<0.442.0>,<0.443.0>,<0.444.0>,<0.445.0>, <0.446.0>,<0.447.0>,<0.448.0>] 4> timer:sleep(1000). ok 5> hackney_pool:get_stats(test_pool). [{name,test_pool}, {max,10}, {in_use_count,0}, {free_count,12}, {queue_count,0}]

Aside: I'm still not certain this is the correct way to handle the issue as is there not still a possibility the Requester dies before receiving the connection or fails to return it? I'm wondering if maybe the pool should be monitoring the Requester instead and return the connection if the Requester dies without the connection being returned. I'm also a bit confused as the code in hackney_pool seems to be indicating that the Requester is being monitored in some fashion.

On Thu, Mar 18, 2021 at 7:01 AM Benoit Chesneau @.***> wrote:

@ronaldwind https://github.com/ronaldwind can you try this ptch

diff --git src/hackney_pool.erl src/hackney_pool.erl index c913998..ba3a739 100644 --- src/hackney_pool.erl +++ src/hackney_pool.erl @@ -75,8 +75,8 @@ checkout(Host, Port, Transport, Client) -> Requester ! {checkout, Ref, Result}; false -> case Result of

{ok, SocketRef, Socket} ->

checkin(SocketRef, Socket);

{ok, {_Name, ConnRef, Connection, Owner, Transport}, Socket} ->

gen_server:call(Owner, {checkin, ConnRef, Connection, Socket, Transport}, infinity); _Error -> ok end

also can you send me a trace. You can do it in private on my mail if you want to

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benoitc/hackney/issues/680#issuecomment-801828924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFKWMFVHLZCAP432VHCGTDTEHMO3ANCNFSM4ZKUW3QA .

ronaldwind commented 3 years ago

@benoitc I can also confirm the patch works fine! Thanks @leonardb for testing it too.

aboroska commented 3 years ago

@benoitc Is it an oversight that hackney_pool:monitor_client/3 does not actually call erlang:monitor/2 ? https://github.com/benoitc/hackney/blob/039db6033b9f81d75c9b0b672f9f8b96e9111575/src/hackney_pool.erl#L628-L630 Where is the monitor placed on the Requester?

benoitc commented 3 years ago

thanks @ronaldwind @leonardb I will check it myself. Expect a new release later today.

@leonardb so there is new mechanism to handle the connections that will come the next major version. But right now the connection are indeed monitored via hackney_manager. The current patch is about fixing a race condition that happen when the requester die before it's managed. All this system is overly complicated and will change soon. One of the advantage of it however is letting you pass the control of a connection to another process if needed.

benoitc commented 3 years ago

@benoitc Is it an oversight that hackney_pool:monitor_client/3 does not actually call erlang:monitor/1 ?

https://github.com/benoitc/hackney/blob/039db6033b9f81d75c9b0b672f9f8b96e9111575/src/hackney_pool.erl#L628-L630

Where is the monitor placed on the Requester?

https://github.com/benoitc/hackney/blob/master/src/hackney_connect.erl#L224

aboroska commented 3 years ago

@benoitc You mean hackney_manager:update_state/1 ? That does not place any monitor. I am still confused.

I only found two places where a monitor is placed. Here: https://github.com/benoitc/hackney/blob/039db6033b9f81d75c9b0b672f9f8b96e9111575/src/hackney_stream.erl#L28

and here:

https://github.com/benoitc/hackney/blob/039db6033b9f81d75c9b0b672f9f8b96e9111575/src/hackney_manager.erl#L554

benoitc commented 3 years ago

I don't use monitor but instead linking the processes ans traping exit in manager:

https://github.com/benoitc/hackney/blob/master/src/hackney_manager.erl#L570

aboroska commented 3 years ago

Oh, Ic. Then in case of Requester exit the manager sends an explicit 'DOWN' message to the pool process: https://github.com/benoitc/hackney/blob/master/src/hackney_manager.erl#L387 That is how the client removed and in_use_count is reduced at last in hackney_pool.

Thank you @benoitc. I add this here for those wondering about the same might find it useful.

benoitc commented 3 years ago

1.17.4 has been published @aboroska @ronaldwind

benoitc / hackney

spawn_link in hackney_pool issue #680

Edit