Disconnected from Phoenix channel when sending multiple large messages in a small time frame

Hello,

First of all I'd like to say good job for this client, we are using it to communicate between Phoenix apps.

All is working fine but we are now facing performance issues...

Context

A Phoenix app called "the broker" exposing a channel
A Phoenix app called "the client" connecting to the broker channel using the phoenix_gen_socket_client package (https://hex.pm/packages/phoenix_gen_socket_client)

Environment

Broker app:

Elixir 1.5.1
Phoenix v1.3
Freebsd jail v10.3 amd64

Client app:

Elixir 1.5.1
Phoenix v1.3
Mac os X

Problem

When sending multiple messages over the channel, using for example a loop or Enum.each, only few messages manage to get to the broker channel handle_in function. Then the client crash with the following error:

21:45:58.027 [error] State machine #PID<0.471.0> terminating Last message in was {:send_frame, {:binary, <<31, 139, 8, 0, 0, 0, 0, 0, 0, 3, 188, 187, 231, 206, 235, 74, 146, 45, 248, 42, 131, 51, 63, 217, 183, 232, 93, 253, 163, 104, 69, 47, 209, 115, 106, 80, 160, 247, 222, 179, 113, 223, 125, 184, 247, 62, 231, 84, ...>>}} When State == :connected Data == {:context, {:websocket_req, :wss, 'broker.dexyde.com', 443, '/socket/websocket?user_token=SFMyNTY.g3QAAAACZAAEZGF0YW4HAIClCS9hXAVkAAZzaWduZWRuBgC9YkVUXwE.ZRbhkBm4BxzGEXLynPpfXDuGTcuZQMdhgcJszmZd3zo', 30000, :undefined, 1, {:sslsocket, {:gen_tcp, #Port<0.20422>, :tls_connection, :undefined}, #PID<0.476.0>}, {:transport, :ssl, :ssl, :ssl_closed, :ssl_error, [mode: :binary, active: true, verify: :verify_none, packet: 0]}, "12krm8DaxFlr/ZuaS8p6Uw==", 139669, 1, 2, :undefined, :undefined}, {:transport, :ssl, :ssl, :ssl_closed, :ssl_error, [mode: :binary, active: true, verify: :verify_none, packet: 0]}, [], {:wss, 'broker.dexyde.com', 443, '/socket/websocket?user_token=SFMyNTY.g3QAAAACZAAEZGF0YW4HAIClCS9hXAVkAAZzaWduZWRuBgC9YkVUXwE.ZRbhkBm4BxzGEXLynPpfXDuGTcuZQMdhgcJszmZd3zo'}, {Phoenix.Channels.GenSocketClient.Transport.WebSocketClient, %{keepalive: 30000, socket: #PID<0.469.0>}}, <<31, 139, 8, 0, 0, 0, 0, 0, 0, 3, 188, 187, 231, 206, 235, 90, 118, 45, 248, 42, 141, 211, 63, 105, 155, 57, 213, 63, 138, 81, 204, 18, 51, 187, 26, 5, 230, 156, 51, 141, 251, 238, 151, ...>>, false, :undefined, 0} Reason for termination = {:error, :closed}


We are using binary messages with custom serializer on the broker side. The observed size of the payload is, before zipping it, between 500K and 1M character.

### Tests done
- Add a :timer.sleep(3000) between each push from the client so that the messages arrive in the handle_in function of the broker and it can reply to the handle_reply of the client.
- Reducing the size of the message payload seems to work too.

### Our conclusion...so far
It seems that when a client app is sending multiple large messages over the channel in a small time frame, it got disconnected from the broker channel.

### Questions

- Do anybody has already experience something similar when sending many large message in a small time frame ?
- Is there a kind of channel or socket limitations to be known ?

Any though or remarks are most welcome on this subject.

Many thanks for your help.

Regards,

Julien

It's hard to say without a reproducible example. Can you somehow make a simple project which demonstrates the problem? If yes, we'd be happy to debug.

Without an example, and based on your description, my first hunch is that a netsplit occurs because of long deserialization time on the Phoenix side. At our company, we actually have a similar use case to yours. We occasionally send some large messages, and originally we used a custom JSON/gzip serializer. At some point we noticed various problems with very large messages, though I can't recall if the symptom was exactly the same as yours.

Our solution boiled down to two things:

compressing/decompressing the payload outside of the serializer
serializing/deserializing messages using :erlang.term_to_binary and :erlang.binary_to_term

The thing is that on the Phoenix side, messages are deserialized in the socket process, so long deserializations of large messages block the entire communication, which might provoke a netsplit.

With our new approach, compression/decompression is not done in the socket process, which vastly reduced the decoding time in the socket process.

Since the payload is compressed, we went a step further and switched the serializer to be based on :erlang.term_to_binary and :erlang.binary_to_term, which work pretty fast, even for larger payloads.

So on the Phoenix side, a message processing now goes as follows:

The serializer decodes the message (using :binary_to_term).
The payload of the message is still encoded (gzipped), but the message is now forwarded to the channel process.
The channel process can now unzip the message without blocking the socket process.

We actually went a step further, and uncompressed the message in a separate GenServer. We did this to keep the channel responsive to other messages.

With those changes, we were able to significantly reduce processing time in the socket process, and in the channel process. As a result, these processes were much more responsive.

I'm not sure if this would fix your problems, but you could give it a try.

J0 / phoenix_gen_socket_client