J0 / phoenix_gen_socket_client

Socket client behaviour for phoenix channels
MIT License
232 stars 48 forks source link

Disconnected from Phoenix channel when sending multiple large messages in a small time frame #25

Closed jwazne closed 5 years ago

jwazne commented 6 years ago

Hello,

First of all I'd like to say good job for this client, we are using it to communicate between Phoenix apps.

All is working fine but we are now facing performance issues...

Context

Environment

Broker app:

Client app:

Problem

When sending multiple messages over the channel, using for example a loop or Enum.each, only few messages manage to get to the broker channel handle_in function. Then the client crash with the following error:

21:45:58.027 [error] State machine #PID<0.471.0> terminating Last message in was {:send_frame, {:binary, <<31, 139, 8, 0, 0, 0, 0, 0, 0, 3, 188, 187, 231, 206, 235, 74, 146, 45, 248, 42, 131, 51, 63, 217, 183, 232, 93, 253, 163, 104, 69, 47, 209, 115, 106, 80, 160, 247, 222, 179, 113, 223, 125, 184, 247, 62, 231, 84, ...>>}} When State == :connected Data == {:context, {:websocket_req, :wss, 'broker.dexyde.com', 443, '/socket/websocket?user_token=SFMyNTY.g3QAAAACZAAEZGF0YW4HAIClCS9hXAVkAAZzaWduZWRuBgC9YkVUXwE.ZRbhkBm4BxzGEXLynPpfXDuGTcuZQMdhgcJszmZd3zo', 30000, :undefined, 1, {:sslsocket, {:gen_tcp, #Port<0.20422>, :tls_connection, :undefined}, #PID<0.476.0>}, {:transport, :ssl, :ssl, :ssl_closed, :ssl_error, [mode: :binary, active: true, verify: :verify_none, packet: 0]}, "12krm8DaxFlr/ZuaS8p6Uw==", 139669, 1, 2, :undefined, :undefined}, {:transport, :ssl, :ssl, :ssl_closed, :ssl_error, [mode: :binary, active: true, verify: :verify_none, packet: 0]}, [], {:wss, 'broker.dexyde.com', 443, '/socket/websocket?user_token=SFMyNTY.g3QAAAACZAAEZGF0YW4HAIClCS9hXAVkAAZzaWduZWRuBgC9YkVUXwE.ZRbhkBm4BxzGEXLynPpfXDuGTcuZQMdhgcJszmZd3zo'}, {Phoenix.Channels.GenSocketClient.Transport.WebSocketClient, %{keepalive: 30000, socket: #PID<0.469.0>}}, <<31, 139, 8, 0, 0, 0, 0, 0, 0, 3, 188, 187, 231, 206, 235, 90, 118, 45, 248, 42, 141, 211, 63, 105, 155, 57, 213, 63, 138, 81, 204, 18, 51, 187, 26, 5, 230, 156, 51, 141, 251, 238, 151, ...>>, false, :undefined, 0} Reason for termination = {:error, :closed}


We are using binary messages with custom serializer on the broker side. The observed size of the payload is, before zipping it, between 500K and 1M character.

### Tests done
- Add a :timer.sleep(3000) between each push from the client so that the messages arrive in the handle_in function of the broker and it can reply to the handle_reply of the client.
- Reducing the size of the message payload seems to work too.

### Our conclusion...so far
It seems that when a client app is sending multiple large messages over the channel in a small time frame, it got disconnected from the broker channel.

### Questions

- Do anybody has already experience something similar when sending many large message in a small time frame ?
- Is there a kind of channel or socket limitations to be known ?

Any though or remarks are most welcome on this subject.

Many thanks for your help.

Regards,

Julien
sasa1977 commented 6 years ago

It's hard to say without a reproducible example. Can you somehow make a simple project which demonstrates the problem? If yes, we'd be happy to debug.

Without an example, and based on your description, my first hunch is that a netsplit occurs because of long deserialization time on the Phoenix side. At our company, we actually have a similar use case to yours. We occasionally send some large messages, and originally we used a custom JSON/gzip serializer. At some point we noticed various problems with very large messages, though I can't recall if the symptom was exactly the same as yours.

Our solution boiled down to two things:

The thing is that on the Phoenix side, messages are deserialized in the socket process, so long deserializations of large messages block the entire communication, which might provoke a netsplit.

With our new approach, compression/decompression is not done in the socket process, which vastly reduced the decoding time in the socket process.

Since the payload is compressed, we went a step further and switched the serializer to be based on :erlang.term_to_binary and :erlang.binary_to_term, which work pretty fast, even for larger payloads.

So on the Phoenix side, a message processing now goes as follows:

  1. The serializer decodes the message (using :binary_to_term).
  2. The payload of the message is still encoded (gzipped), but the message is now forwarded to the channel process.
  3. The channel process can now unzip the message without blocking the socket process.

We actually went a step further, and uncompressed the message in a separate GenServer. We did this to keep the channel responsive to other messages.

With those changes, we were able to significantly reduce processing time in the socket process, and in the channel process. As a result, these processes were much more responsive.

I'm not sure if this would fix your problems, but you could give it a try.