ERL-622: Deadlock when SSL sockets are simultaneously sending/receiving data and buffers are full

OTP-Maintainer commented 6 years ago

Original reporter: jmf_erl Affected versions: OTP-20.3, OTP-20.3.6 Fixed in version: OTP-21.1 Component: ssl Migrated from: https://bugs.erlang.org/browse/ERL-622

I found an issue when running RabbitMQ servers with inter-node SSL.
The discussion about the issue is in https://groups.google.com/forum/#!search/RabbitMQ$20Inter-Node$20SSL$20problems/rabbitmq-users/Ig28xQbj91Y/uIYxIPslBAAJ

\\
A user eventually recommended "asking this on erlang-questions since RabbitMQ does not implement inter-node communication nor does it implement TLS or their combination."

\\
All the setup used and some logs are in the discussion, but I ended up finding some interesting information when the RabbitMQ nodes lose connectivity to each other and report a network partition.

\\
This can be seen in the status of connections, as reported by _netstat_:
*rabbit 172.17.0.3*
{code}
Proto  Recv-Q   Send-Q   Local Address     Foreign Address    State        PID/Program name
tcp    4341308  3012664  172.17.0.3:25672  172.17.0.4:43579   ESTABLISHED  15317/beam.smp
{code}

\\
*rabbit 172.17.0.4*
{code}
Proto  Recv-Q   Send-Q   Local Address      Foreign Address   State        PID/Program name
tcp    4965212  4006904  172.17.0.4:43579   172.17.0.3:25672  ESTABLISHED  15545/beam.smp
{code}

\\
By running _rabbit_diagnostics:maybe_stuck()_. I get these processes (there's another one which I'm excluding because it is also shown in normal operation):
{code}
2018-05-09 10:32:18 [{pid,<9322.614.0>},
                     {registered_name,[]},
                     {current_stacktrace,
                         [{prim_inet,send,3,[]},
                          {ssl_connection,write_application_data,3,
                              [{file,"ssl_connection.erl"},{line,390}]},
                          {ssl_connection,connection,4,
                              [{file,"ssl_connection.erl"},{line,920}]},
                          {gen_statem,call_state_function,5,
                              [{file,"gen_statem.erl"},{line,1633}]},
                          {gen_statem,loop_event_state_function,6,
                              [{file,"gen_statem.erl"},{line,1023}]},
                          {tls_connection,init,1,
                              [{file,"tls_connection.erl"},{line,119}]},
                          {proc_lib,init_p_do_apply,3,
                              [{file,"proc_lib.erl"},{line,247}]}]},
                     {initial_call,{proc_lib,init_p,5}},
                     {message_queue_len,1},
                     {links,[<9322.53.0>,#Port<9322.29167>]},
                     {monitors,[{process,<9322.615.0>}]},
                     {monitored_by,[<9322.615.0>]},
                     {heap_size,75113}]
2018-05-09 10:32:18 [{pid,<9322.615.0>},
                     {registered_name,[]},
                     {current_stacktrace,
                         [{gen,do_call,4,[{file,"gen.erl"},{line,169}]},
                          {gen_statem,call_dirty,4,
                              [{file,"gen_statem.erl"},{line,591}]},
                          {ssl_connection,call,2,
                              [{file,"ssl_connection.erl"},{line,1961}]},
                          {ssl_tls_dist_proxy,loop_conn,2,
                              [{file,"ssl_tls_dist_proxy.erl"},{line,364}]}]},
                     {initial_call,{erlang,apply,2}},
                     {message_queue_len,18},
                     {links,[<9322.59.0>]},
                     {monitors,[{process,<9322.614.0>}]},
                     {monitored_by,[<9322.614.0>]},
                     {heap_size,233}]
{code}

Checked _ssl_connection.erl_ and noticed the following comment on line 916, right before the call to _write_application_data_ in line 920:
{code}
    %% We should look into having a worker process to do this to 
    %% parallize send and receive decoding and not block the receiver
    %% if sending is overloading the socket.
{code}

It seems this really is the cause...
After porting a patch from [https://github.com/basho/otp/commit/e237383fe002af4f63f8419b35bac6f435856a38] into the latest Erlang/OTP 20.3.6 the network partition only happened once after I sent A LOT of big messages (and even then the network connections didn't get stalled with data anymore).

I'm far from being an expert on Erlang, so I don't know if that patch is adequate or not; It just helped me to find the (very likely) cause of the issue.
Is there any chance of this commit (or something more adequate, if needed) being integrated in version 20.3 and newer?

\\
----
[~ingela] in ERL-561 you mentioned
{quote}It looks like a dead lock that can happen when the same process is sending and receiving. I do not remember all the details but mainly prim_inet:send hangs as the receive buffer is full.
[...]
We actually have a ticket to deal with this that has not been prioritized.
{quote}
Which ticket is that? Do you think ERL-561 is in any way related to the current issue?

OTP-Maintainer commented 6 years ago

ingela said:

The ticket is internal, that is it has never been a reported problem. We are aware that you can solve the problem by having two different processes doing send and receive. However the implications on the handling of cipher states and renegotiation must be considered or a proxy process introduced. Some concerns where raised that the proxy solution would create latency. So investigation of how to best solve it is needed and this was never prioritized. 

So looking further into this we think that the proposed patch is only symptom handling but we have another idea on how to solve the problem properly.

OTP-Maintainer commented 6 years ago

ingela said:

I have now created a branch with what we think is the proper solution. However it is not release ready yet, but I could publish
it for testing. Would you be able to recreate the problem so that you could verify the patch? We will be working on some kind of test too but will be preoccupied with Code BEAM the rest of this week.

OTP-Maintainer commented 6 years ago

jmf_erl said:

Thank you [~ingela], if you point me to that branch I can quickly test it!

OTP-Maintainer commented 6 years ago

ingela said:

https://github.com/IngelaAndin/otp/tree/ingela/ssl/dead-lock/ERL-622

OTP-Maintainer commented 6 years ago

ingela said:

I had forgot to export gen_tcp:send/3, force pushed my branch just now!

OTP-Maintainer commented 6 years ago

jmf_erl said:

When I looked at the branch yesterday, I noticed it was based on erlang:master.
I would like to check if the problem is fixed by running an end-to-end system test, which requires RabbitMQ server running.
RabbitMQ requires Erlang/OTP v20.x, it doesn't seem to be compatible with v21 yet...
I will try to apply your changes to the erlang:maint-20 branch - do you think it is possible and easy to to?
If I'm not able to rebase those changes in 20.x would it take much time for you to do it?

Thank you in advance!

OTP-Maintainer commented 6 years ago

ingela said:

Will look into it on Monday! Spent the end of the week in Code BEAM Stockholm.

OTP-Maintainer commented 6 years ago

ingela said:

Here is a maint-20  based branch  ingela/maint/ssl/dead-lock/ERL-622  (that is based on latest 20 version)
same repo as last time.

OTP-Maintainer commented 6 years ago

jmf_erl said:

Thank you!

Ok, I used RabbitMQ 3.6.14 with Erlang 20.1 initially to make sure that in my current machine the problem was still occurring.
After confirming that, I compiled the version in your main-20 branch, and restarted RabbitMQ.

Having confirmed that the version of Erlang reported was now 20.3.6, I sent a couple of messages like before.
Unfortunately, after a couple of seconds the RabbitMQ management interface reported a Network partition,
 and I confirmed with netstat the same problem as before:
{code}
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address             State   PID/Program name
tcp   761444 374672 172.17.0.4:50277            172.17.0.3:25672            ESTABLISHED 26323/beam.smp
{code}
The TCP queues are stuck for _beam.smp_ and will stay like that indefinitely...

It gets stuck in the same places as before (somewhat):
{code}
2018-06-04 19:16:41 [{pid,<11297.88.0>},
                     {registered_name,[]},
                     {current_stacktrace,
                         [{gen,do_call,4,[{file,"gen.erl"},{line,169}]},
                          {gen_statem,call_dirty,4,
                              [{file,"gen_statem.erl"},{line,591}]},
                          {ssl_connection,call,2,
                              [{file,"ssl_connection.erl"},{line,1966}]},
                          {ssl_tls_dist_proxy,loop_conn,2,
                              [{file,"ssl_tls_dist_proxy.erl"},{line,364}]}]},
                     {initial_call,{erlang,apply,2}},
                     {message_queue_len,12},
                     {links,[#Port<11297.3914>,<11297.51.0>]},
                     {monitors,[{process,<11297.90.0>}]},
                     {monitored_by,[<11297.90.0>]},
                     {heap_size,233}]
2018-06-04 19:16:41 [{pid,<11297.90.0>},
                     {registered_name,[]},
                     {current_stacktrace,
                         [{prim_inet,send,3,[]},
                          {tls_socket,send,4,
                              [{file,"tls_socket.erl"},{line,48}]},
                          {tls_connection,send_application_data,5,
                              [{file,"tls_connection.erl"},{line,383}]},
                          {ssl_connection,write_application_data,3,
                              [{file,"ssl_connection.erl"},{line,392}]},
                          {ssl_connection,connection,4,
                              [{file,"ssl_connection.erl"},{line,925}]},
                          {gen_statem,call_state_function,5,
                              [{file,"gen_statem.erl"},{line,1633}]},
                          {gen_statem,loop_event_state_function,6,
                              [{file,"gen_statem.erl"},{line,1023}]},
                          {tls_connection,init,1,
                              [{file,"tls_connection.erl"},{line,119}]}]},
                     {initial_call,{proc_lib,init_p,5}},
                     {message_queue_len,35728},
                     {links,[<11297.53.0>,#Port<11297.3682>]},
                     {monitors,[{process,<11297.88.0>}]},
                     {monitored_by,[<11297.88.0>]},
                     {heap_size,514838}]
{code}

Tell me if you need more information....

OTP-Maintainer commented 6 years ago

ingela said:

Humm ... I did not know it was the erlang distribution over TLS you where using. Alas I think this patch will not be enough for the distribution in OTP 20. I think it will work in OTP 21 when the distribution is rewritten. But in 20 it is more complicated with more sockets involved that may be the underlying cause of the problem.

OTP-Maintainer commented 6 years ago

ingela said:

After trying to write a test case for this, I alas think that  my proposed solution will not work. So we will not include it for 21. Of course we will continue trying to find a real solution to this problem.

OTP-Maintainer commented 6 years ago

jmf_erl said:

Thanks for checking this issue, I've been doing some testing around it too.

First of all, I checked the possibility of using your patch on version 21, but it seems it isn't supported yet by RabbitMQ and some libraries it requires.

I also created a test replicating the problematic scenario:
* 4 RabbitMQ servers, 2 on each of 2 datacenters, all connections configured with TLS encryption (all servers in Docker containers);
* 5 publishers in each datacenter, each publisher will send 10 messages (a big one with 5MB + 9 with 100KB);
* each publisher sends the messages with 1 second interval; all publishers work independently (we can have 10 messages being sent simultaneously).
If you need more information for your own testing, I can provide you the details.
The versions used were RabbitMQ 3.6.14 and Erlang 20.3.7.
This test was used to compare Erlang with and without the patch I mentioned in the description for this issue (from Basho, which they used in Riak).

First I ran the test 200 times and without patch it failed 194, with patch it failed 2.
As I also mentioned, _netstat_ reports that without the patch the TCP send/receive queues get stuck indefinitely, but with the patch the data from the queues is eventually handled.
I increased the kernel net_ticktime from 60 (default) to 120 seconds, to give the servers some time to distribute the big message and then send the net ticks.

Then I ran the test again without the patch 300 times, it failed 272.
I also ran again 600 times with the patch and there were no failures.

Given these results, do you think it is worth the effort taking another look at that patch?...
It may not be the best solution, but it appears to be working and perhaps it's the best that could be done for now in Erlang v20 (where I guess that there won't be any major changes in SSL/TLS/distribution code that might address this problem another way).

OTP-Maintainer commented 6 years ago

ingela said:

Humm ... my patch was based on the assumption that I would get {error, busy} if the send call would block and no data would be sent. But I do not think this assumption holds. Maybe my patch changes timing making the problem less likely to occur. I might have another idea though.... If you could use our coverage tool during your test to see if the  {error, busy}  is ever returned that could be interesting. I can not get that behaviour.  Get back to you later.

OTP-Maintainer commented 6 years ago

ingela said:

I almost finished a solution, but as it vacation time please be patient.

OTP-Maintainer commented 6 years ago

ingela said:

Hello thank you for your patience. 

I now have a branch that I hope you can test.  It is based on maint branch that is OTP-21.
We may backport it to 20 if it works out as we hope.

https://github.com/IngelaAndin/otp/tree/ingela/ssl/send-recv-dead-lock/ERL-622

OTP-Maintainer commented 6 years ago

ingela said:

Made a few adjustments during the day so make sure to download the latest version.  Top commit  should be 0c2e71f7f4f046afb04b597e04e9e24ea0a458ee

OTP-Maintainer commented 6 years ago

ingela said:

Acctualy dialyzer triggered an other update, new sha 72e77c58ae614e6a368590f445ed7c4e669037a0

OTP-Maintainer commented 6 years ago

urbanserj said:

Our [project|https://github.com/dcos/dcos-net] have been suffering from this issue couple times every week with OTP 18+.  In our case, all nodes in the cluster send and receive 10Mb+ binary blobs through erlang distribution protocol. As a workaround for that bug, I've recently added a [killer|https://github.com/dcos/dcos-net/blob/7ad311b/apps/dcos_net/src/dcos_net_killer.erl] for stuck processes.

> We are aware that you can solve the problem by having two different processes doing send and receive.

I'm not sure that it will resolve the issue completely. From my experience, tcp client socket could get stuck even if tcp server doesn't send any data at all (by protocol design). My suggestion is to set proper timeouts for all tcp sockets: gen_tcp:connect/4 with Timeout, \{send_timeout, Timeout\} and, optionally, \{send_timeout_close, boolean()\}.

OTP-Maintainer commented 6 years ago

ingela said:

Granted that this solution will not prevent the sender process from hanging, however it will prevent the deadlock and the blocked sender can be killed by the reader process if it for example does not receive ticks. I would be valuable to use  if you could test my branch and see how it works out for you. If you like
you could perhaps still have your killer and observe if it is invoked or not. We think that my branch also
has other benefits for the ssl application and especially for erlang distribution over TLS and we will release it as is. We will consider gen_tcp timeouts for future improvements if there seems to be a need and you could further explain the problem, we are not sure that we understand it fully.

OTP-Maintainer commented 6 years ago

ingela said:

Well after thinking about this further I decided to rework the solution a little, that is I am no longer assuming that the inet  driver will be gracefull. So I pushed a new version of my branch with a new close down handling commit. Notice that two commits before are a little reworked too. This is looking good in our builds and will most probably make 21.1 

Hope you will like it too!

OTP-Maintainer commented 6 years ago

ingela said:

This has now been merged as it at least is an improvement.  It solves the dead lock and makes sure that tls_sender process always dies if the connection is closed.  If it should turn out you still have some problem you are welcome to file a new report. I gather you have a workaround for 20 so currently we have no plans to patch OTP 20.

OTP-Maintainer commented 6 years ago

mikael pettersson said:

We had been hoping to run Erlang distribution over TLS with OTP-20, but I gather from this ticket that doing so wouldn't be advisable?

OTP-Maintainer commented 6 years ago

ingela said:

No it would not be advisable. We might consider back-porting it to 20 if it is verified that the above problem is solved by the patch.  I think it should, but I do not have any test setup for it.  Of course we have tests that test the new code is working and we think it has benefits regardless of the deadlock problem.

OTP-Maintainer commented 6 years ago

jmf_erl said:

Hi @Ingela
Thanks for checking this out! I've just returned from vacation, I'll try to find some time during this week to check if this patch solves the problem I described.
I'll test it and post a comment here with the results.

OTP-Maintainer commented 6 years ago

jesperes said:

Our strategy to migrate our cluster to TLS has relied on the proxy-based solution employed prior to OTP 21. We have modified the {{inet_tls_dist}} module to allow both TLS and non-TLS connections by exposing the internal proxy socket as well as the TLS socket. This solution would as far as I can tell not be immediately portable to OTP 21, as there no longer is a proxy to expose. Will a backport of this fix to OTP 20 cause OTP 20 to be proxy-less as well?

OTP-Maintainer commented 6 years ago

ingela said:

[~jesperes] Yes if we will backport this solution we would have to backport that change too, otherwise it would be like implementing a new solution for OTP -20 which Ericsson has no interest in. Ericsson has an interest in networks  using both TLS and non-TLS connections, although we do not yet have a ready solution.

OTP-Maintainer commented 6 years ago

jesperes said:

We would certainly be interested in testing or giving feedback on any such solution as early as possible.

OTP-Maintainer commented 6 years ago

urbanserj said:

Hi [~ingela],

I have been testing disterl over TLS with {{maint}} branch (sha: d6a0884) and it looks like I hit another bug.

I start a node with very simple {{tls.config}} (please see it below) and two remote shells:

{code}
erl -proto_dist inet_tls -ssl_dist_optfile ./tls.config -setcookie minuteman -name navstar@127.0.0.1
erl -proto_dist inet_tls -ssl_dist_optfile ./tls.config -setcookie minuteman -name remsh-a@127.0.0.1 -remsh navstar@127.0.0.1 -hidden
erl -proto_dist inet_tls -ssl_dist_optfile ./tls.config -setcookie minuteman -name remsh-b@127.0.0.1 -remsh navstar@127.0.0.1 -hidden
{code}

After a few restarts of {{remsh-b}} both remsh nodes get the following error:

{code}
*** ERROR: Shell process terminated! (^G to start new job) ***
{code}

Here is my tls.config:

{code}
[
  {server,
    [
      {certfile, "/run/dcos/pki/tls/certs/dcos-net.crt"},
      {keyfile, "/run/dcos/pki/tls/private/dcos-net.key"},
      {cacertfile, "/run/dcos/pki/CA/ca-bundle.crt"}
    ]
  },
  {client,
    [
      {certfile, "/run/dcos/pki/tls/certs/dcos-net.crt"},
      {keyfile, "/run/dcos/pki/tls/private/dcos-net.key"},
      {cacertfile, "/run/dcos/pki/CA/ca-bundle.crt"}
    ]
  }
].
{code}

I reverted {{ingela/ssl/send-recv-dead-lock/ERL-622}} (sha: 96aa684) from {{maint}} and it works just fine. 

I also would like to share that my workaround with killing stuck processes doesn’t work well: killer terminates stuck processes, but it doesn’t close file descriptors (still dunno why). netstat shows tons of CLOSE_WAIT localhost-to-localhost connections, both source and destination ports of all connections are ephemeral . Our current (and very bad) workaround is killing the whole vm if the killer detects any processes that got stuck.

May I kindly ask you to take a look at that issue and back port the patch to OTP 20? I’d really appreciate it.

Thank you in advance.

OTP-Maintainer commented 6 years ago

ingela said:

I think it is a link that was created in the wrong place, could you try (it works for me):

diff --git a/lib/ssl/src/tls_connection.erl b/lib/ssl/src/tls_connection.erl
index 6c7511d..2fde17a 100644
--- a/lib/ssl/src/tls_connection.erl
+++ b/lib/ssl/src/tls_connection.erl
@@ -118,6 +118,7 @@ start_link(Role, Sender, Host, Port, Socket, Options, User, CbInfo) ->

 init([Role, Sender, Host, Port, Socket, {SslOpts, _, _} = Options,  User, CbInfo]) ->
     process_flag(trap_exit, true),
+    link(Sender),
     case SslOpts#ssl_options.erl_dist of
         true ->
             process_flag(priority, max);
diff --git a/lib/ssl/src/tls_sender.erl b/lib/ssl/src/tls_sender.erl
index 007fd34..db67d7d 100644
--- a/lib/ssl/src/tls_sender.erl
+++ b/lib/ssl/src/tls_sender.erl
@@ -67,9 +67,9 @@
 %%  same process is sending and receiving 
 %%--------------------------------------------------------------------
 start() ->
-    gen_statem:start_link(?MODULE, [], []).
+    gen_statem:start(?MODULE, [], []).
 start(SpawnOpts) ->
-    gen_statem:start_link(?MODULE, [], SpawnOpts).
+    gen_statem:start(?MODULE, [], SpawnOpts).

 %%--------------------------------------------------------------------
 -spec initialize(pid(), map()) -> ok.

OTP-Maintainer commented 6 years ago

urbanserj said:

[~ingela] Thank you very much for the prompt fix! I've patched {{maint}} branch and it works fine.

Would it be possible to back port tls_sender to OTP 20?

OTP-Maintainer commented 6 years ago

ingela said:

Thank you for testing. We still have some work to do on our automated tests for TLS distribution, it is a work in progress that is non trivial.

Everything is possible it just a question of priorities. And in this case it would require to back port  some other commits too as the new sender process will not fit into the old proxy distribution solution. In this case we probably think it is worth it as it increases usability a lot. I can not promise when, OTP-21.1 has highest priority of course.

OTP-Maintainer commented 5 years ago

jmf_erl said:

@Ingela
Thank you for your support, I have finally managed to test the fix.
Using latest release Erlang/OTP 21.1, I tested RabbitMQ 3.7.8 configured for inter-node SSL and the problem I described in the ticket doesn't occur anymore
:)

OTP-Maintainer commented 5 years ago

ingela said:

Hello [~jmf_erl] it is very nice that you have confirmed  my patch :) It was quite a big rewrite and in OTP-21.2 there will be a little fix for the patch as alas there was a small error in it,  ERL-770.  I have looked into back-porting the patch to OTP 20, but alas the distribution part can not be back-ported to 20 as it uses new
runtime system features of OTP 21. So  alas fixing this in OTP 20 requires an entirely different implementation for the distribution part. So I am afraid upgrading is your best alternative.

erlang / otp

ERL-622: Deadlock when SSL sockets are simultaneously sending/receiving data and buffers are full #3420