NFIBrokerage / slipstream

A slick WebSocket client for Phoenix Channels
https://hex.pm/packages/slipstream
Apache License 2.0
155 stars 18 forks source link

Slipstream.Connection.Impl.decode_message/2 crashes Slipstream in OTP 26 #57

Closed guillego closed 1 year ago

guillego commented 1 year ago

I have a small toy application to test connectivity from a Slipstream client to a Phoenix channel.

(Using Slipstream 1.1.0)

The app works perfectly in OTP 25:

$ iex -S mix
Erlang/OTP 25 [erts-13.2.2.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

Compiling 1 file (.ex)

23:23:42.051 [info] Requesting connection to socket ws://localhost:4000/device_socket/websocket
Interactive Elixir (1.15.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)>
23:23:42.169 [info] Connected to channel, joining topic device:general

23:23:42.190 [info] Successfully joined device:general, sending HELLO

However, if I switch my environment to OTP 26, I get this crashing error that seems to happen when Slipstream attempts to decode a tcp message:

$ iex -S mix

23:30:05.621 [info] Requesting connection to socket ws://localhost:4000/device_socket/websocket
Erlang/OTP 26 [erts-14.0.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

Interactive Elixir (1.15.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)>
23:30:05.732 [info] Connected to channel, joining topic device:general

23:30:05.739 [error] GenServer #PID<0.196.0> terminating
** (FunctionClauseError) no function clause matching in Slipstream.Connection.Impl.decode_message/2
    (slipstream 1.1.0) lib/slipstream/connection/impl.ex:98: Slipstream.Connection.Impl.decode_message({:error, {:malformed_reserved, <<0::size(3)>>}}, %Slipstream.Connection.State{connection_id: "ad0861bdd01fedb0", trace_id: "6971b906fd5bd2013fd8b7b2e8ff7ad9", client_pid: #PID<0.194.0>, client_ref: #Reference<0.3755692760.2161901571.81591>, config: %Slipstream.Configuration{uri: %URI{scheme: "ws", authority: "localhost:4000", userinfo: nil, host: "localhost", port: 4000, path: "/device_socket/websocket", query: nil, fragment: nil}, heartbeat_interval_msec: 30000, headers: [], serializer: Slipstream.Serializer.PhoenixSocketV2Serializer, json_parser: Jason, reconnect_after_msec: [10, 50, 100, 150, 200, 250, 500, 1000, 2000, 5000], rejoin_after_msec: [100, 500, 1000, 2000, 5000, 10000], mint_opts: [protocols: [:http1]], extensions: [], test_mode?: false}, conn: %Mint.HTTP1{host: "localhost", port: 4000, request: nil, streaming_request: nil, socket: #Port<0.4>, transport: Mint.Core.Transport.TCP, mode: :active, scheme_as_string: "http", requests: {[], []}, state: :open, buffer: "", proxy_headers: [], private: %{scheme: :ws, mode: :active, extensions: [], sec_websocket_key: "InnXaRXzD6Kjl48WKxOuYg==", websockets: [#Reference<0.3755692760.2161901572.79796>]}, log: false}, websocket: %Mint.WebSocket{extensions: [], fragment: nil, private: %{}, buffer: ""}, request_ref: #Reference<0.3755692760.2161901572.79796>, join_params: nil, heartbeat_timer: {:interval, #Reference<0.3755692760.2161901572.79812>}, heartbeat_ref: nil, metadata: %{connection_id: "ad0861bdd01fedb0", state: %Slipstream.Connection.State{connection_id: "ad0861bdd01fedb0", trace_id: "6971b906fd5bd2013fd8b7b2e8ff7ad9", client_pid: #PID<0.194.0>, client_ref: #Reference<0.3755692760.2161901571.81591>, config: %Slipstream.Configuration{uri: %URI{scheme: "ws", authority: "localhost:4000", userinfo: nil, host: "localhost", port: 4000, path: "/device_socket/websocket", query: nil, fragment: nil}, heartbeat_interval_msec: 30000, headers: [], serializer: Slipstream.Serializer.PhoenixSocketV2Serializer, json_parser: Jason, reconnect_after_msec: [10, 50, 100, 150, 200, 250, 500, 1000, 2000, 5000], rejoin_after_msec: [100, 500, 1000, 2000, 5000, 10000], mint_opts: [protocols: [:http1]], extensions: [], test_mode?: false}, conn: nil, websocket: nil, request_ref: nil, join_params: nil, heartbeat_timer: nil, heartbeat_ref: nil, metadata: nil, status: :opened, joins: %{}, leaves: %{}, current_ref: 0, current_ref_str: "0"}, trace_id: "6971b906fd5bd2013fd8b7b2e8ff7ad9", start_time: ~U[2023-06-19 21:30:05.693241Z], start_time_monotonic: -576460751472882083}, status: :connected, joins: %{"device:general" => "1"}, leaves: %{}, current_ref: 1, current_ref_str: "1"})
    (slipstream 1.1.0) lib/slipstream/connection/pipeline.ex:122: anonymous fn/2 in Slipstream.Connection.Pipeline.decode_message/1
    (elixir 1.15.0) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
    (slipstream 1.1.0) lib/slipstream/connection/pipeline.ex:121: Slipstream.Connection.Pipeline.decode_message/1
    (slipstream 1.1.0) lib/slipstream/connection/pipeline.ex:34: anonymous fn/1 in Slipstream.Connection.Pipeline.handle/2
    (slipstream 1.1.0) lib/slipstream/connection/telemetry.ex:29: anonymous fn/2 in Slipstream.Connection.Telemetry.span/2
    (telemetry 1.2.1) /Users/guille/personal/channels_mvp/slipstream_client/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    (slipstream 1.1.0) lib/slipstream/connection/telemetry.ex:25: Slipstream.Connection.Telemetry.span/2
Last message: {:tcp, #Port<0.4>, <<129, 68, 91, 34, 49, 34, 44, 34, 49, 34, 44, 34, 100, 101, 118, 105, 99, 101, 58, 103, 101, 110, 101, 114, 97, 108, 34, 44, 34, 112, 104, 120, 95, 114, 101, 112, 108, 121, 34, 44, 123, 34, 114, 101, 115, 112, 111, ...>>}
State: %Slipstream.Connection.State{connection_id: "ad0861bdd01fedb0", trace_id: "6971b906fd5bd2013fd8b7b2e8ff7ad9", client_pid: #PID<0.194.0>, client_ref: #Reference<0.3755692760.2161901571.81591>, config: %Slipstream.Configuration{uri: %URI{scheme: "ws", authority: "localhost:4000", userinfo: nil, host: "localhost", port: 4000, path: "/device_socket/websocket", query: nil, fragment: nil}, heartbeat_interval_msec: 30000, headers: [], serializer: Slipstream.Serializer.PhoenixSocketV2Serializer, json_parser: Jason, reconnect_after_msec: [10, 50, 100, 150, 200, 250, 500, 1000, 2000, 5000], rejoin_after_msec: [100, 500, 1000, 2000, 5000, 10000], mint_opts: [protocols: [:http1]], extensions: [], test_mode?: false}, conn: %Mint.HTTP1{host: "localhost", port: 4000, request: nil, streaming_request: nil, socket: #Port<0.4>, transport: Mint.Core.Transport.TCP, mode: :active, scheme_as_string: "http", requests: {[], []}, state: :open, buffer: "", proxy_headers: [], private: %{scheme: :ws, mode: :active, extensions: [], sec_websocket_key: "InnXaRXzD6Kjl48WKxOuYg==", websockets: [#Reference<0.3755692760.2161901572.79796>]}, log: false}, websocket: %Mint.WebSocket{extensions: [], fragment: nil, private: %{}, buffer: ""}, request_ref: #Reference<0.3755692760.2161901572.79796>, join_params: nil, heartbeat_timer: {:interval, #Reference<0.3755692760.2161901572.79812>}, heartbeat_ref: nil, metadata: %{connection_id: "ad0861bdd01fedb0", state: %Slipstream.Connection.State{connection_id: "ad0861bdd01fedb0", trace_id: "6971b906fd5bd2013fd8b7b2e8ff7ad9", client_pid: #PID<0.194.0>, client_ref: #Reference<0.3755692760.2161901571.81591>, config: %Slipstream.Configuration{uri: %URI{scheme: "ws", authority: "localhost:4000", userinfo: nil, host: "localhost", port: 4000, path: "/device_socket/websocket", query: nil, fragment: nil}, heartbeat_interval_msec: 30000, headers: [], serializer: Slipstream.Serializer.PhoenixSocketV2Serializer, json_parser: Jason, reconnect_after_msec: [10, 50, 100, 150, 200, 250, 500, 1000, 2000, 5000], rejoin_after_msec: [100, 500, 1000, 2000, 5000, 10000], mint_opts: [protocols: [:http1]], extensions: [], test_mode?: false}, conn: nil, websocket: nil, request_ref: nil, join_params: nil, heartbeat_timer: nil, heartbeat_ref: nil, metadata: nil, status: :opened, joins: %{}, leaves: %{}, current_ref: 0, current_ref_str: "0"}, trace_id: "6971b906fd5bd2013fd8b7b2e8ff7ad9", start_time: ~U[2023-06-19 21:30:05.693241Z], start_time_monotonic: -576460751472882083}, status: :connected, joins: %{"device:general" => "1"}, leaves: %{}, current_ref: 1, current_ref_str: "1"}

Not sure if this is a known issue from OTP 26 or if there is some extra configuration that I'm missing. Any pointers?

the-mikedavis commented 1 year ago

Does the server you're testing against use any websocket extensions, for example per-message-deflate? (Maybe something changed in :zlib?) I haven't been able to reproduce this yet against the slipstream testsuite running on OTP26

guillego commented 1 year ago

Thanks for taking a look at it! I was testing against a fresh phoenix project with just a generated socket and channel, this is the Mix file:

defmodule ChannelServer.MixProject do
  use Mix.Project

  def project do
    [
      app: :channel_server,
      version: "0.1.0",
      elixir: "~> 1.14",
      elixirc_paths: elixirc_paths(Mix.env()),
      start_permanent: Mix.env() == :prod,
      aliases: aliases(),
      deps: deps()
    ]
  end

  # Configuration for the OTP application.
  #
  # Type `mix help compile.app` for more information.
  def application do
    [
      mod: {ChannelServer.Application, []},
      extra_applications: [:logger, :runtime_tools]
    ]
  end

  # Specifies which paths to compile per environment.
  defp elixirc_paths(:test), do: ["lib", "test/support"]
  defp elixirc_paths(_), do: ["lib"]

  # Specifies your project dependencies.
  #
  # Type `mix help deps` for examples and options.
  defp deps do
    [
      {:phoenix, "~> 1.7.6"},
      {:phoenix_html, "~> 3.3"},
      {:phoenix_live_reload, "~> 1.2", only: :dev},
      {:phoenix_live_view, "~> 0.19.0"},
      {:floki, ">= 0.30.0", only: :test},
      {:phoenix_live_dashboard, "~> 0.8.0"},
      {:esbuild, "~> 0.7", runtime: Mix.env() == :dev},
      {:tailwind, "~> 0.2.0", runtime: Mix.env() == :dev},
      {:swoosh, "~> 1.3"},
      {:finch, "~> 0.13"},
      {:telemetry_metrics, "~> 0.6"},
      {:telemetry_poller, "~> 1.0"},
      {:gettext, "~> 0.20"},
      {:jason, "~> 1.2"},
      {:plug_cowboy, "~> 2.5"}
    ]
  end

  # Aliases are shortcuts or tasks specific to the current project.
  # For example, to install project dependencies and perform other setup tasks, run:
  #
  #     $ mix setup
  #
  # See the documentation for `Mix` for more info on aliases.
  defp aliases do
    [
      setup: ["deps.get", "assets.setup", "assets.build"],
      "assets.setup": ["tailwind.install --if-missing", "esbuild.install --if-missing"],
      "assets.build": ["tailwind default", "esbuild default"],
      "assets.deploy": ["tailwind default --minify", "esbuild default --minify", "phx.digest"]
    ]
  end
end

And .tool-versions:

erlang 25.3.2.2
elixir 1.15.0-otp-25
guillego commented 1 year ago

Hi @the-mikedavis! I recorded a little terminal session to demo the problem: https://asciinema.org/a/XoY10nHaTYAI5548QyMrRgRJC

You can also find the phoenix test server that I'm using in this repo.

Let me know if there's anything else I can help with to try to find the issue!

the-mikedavis commented 1 year ago

Hmm strangely I can't reproduce with https://github.com/guillego/slipstream_client and https://github.com/guillego/phoenix_channels_server with

erlang 26.0.1
elixir 1.15.0-otp-26

and the same steps as the asciicast. I'll tinker with the projects some more but could you make a packet capture with something like wireshark? If you have wireshark installed you can use tshark -i lo -w capture.pcap

guillego commented 1 year ago

That's so strange! :/ I tried reinstalling erlang/otp 26 but got the same error. Here is the network capture

the-mikedavis commented 1 year ago

The error message is kind of breaking my brain actually 😅

That malformed_reserved can only come from this block: https://github.com/elixir-mint/mint_web_socket/blob/f44a35ef7883008d435091c96e06a41d60849601/lib/mint/web_socket/frame.ex#L378-L384 which is guarded on reserved != <<0::size(3)>> but the error message says:

Slipstream.Connection.Impl.decode_message({:error, {:malformed_reserved, <<0::size(3)>>}}, ..)

And looking at the packet capture, the reserved is definitely <<0::size(3)>>

frame

I wonder if there's some bug in OTP for bitstring matching in guards? I know they added a few new bitstring matching instructions in OTP26 and maybe the JIT definitions for those instructions have a bug? Are you running on a machine with an x86 CPU or ARM (for example the Macs with the Apple CPUs)? I'm on x86 but I can try reproducing this on ARM.

the-mikedavis commented 1 year ago

Ah ok I can reproduce this on a macbook! I'll dig into this a little within OTP and try to fix it or at least make a nice report there.

Edit: https://github.com/erlang/otp/issues/7433

guillego commented 1 year ago

Amazing! Should have mentioned I'm in Apple silicon

guillego commented 1 year ago

Thanks for investigating and opening the issue in otp @the-mikedavis! Looks like they fixed it 🙌 Do you know how often they do patch releases?

the-mikedavis commented 1 year ago

Hmm I'm not sure. I think it will be sooner rather than later since there are a bunch of new fixes on the maint branch. My experience has been a new patch release every few weeks after a major release to iron out bugs like this.

guillego commented 1 year ago

Perfect! Thanks a lot 😄

the-mikedavis commented 1 year ago

I tried out the new 26.0.2 patch release and I no longer see the crash :tada: https://github.com/erlang/otp/releases/tag/OTP-26.0.2

guillego commented 1 year ago

It works perfectly! 🎆 Thanks a lot @the-mikedavis this was suuper helpful! 💛