Kraigie / nostrum

Elixir Discord Library
https://kraigie.github.io/nostrum/
MIT License
598 stars 128 forks source link

User permanently stops receiving events if HEARTBEAT_ACK is not received in time #57

Closed skippi closed 6 years ago

skippi commented 6 years ago
04:01:43.677 [warn]  HEARTBEAT_ACK not received in time, disconnecting
04:03:31.167 [warn]  Unhandled websocket message: {:DOWN, #Reference<0.2512857106.1813512198.243387>, :process, #PID<0.232.0>, {:shutdown, :nxdomain}}

If the user loses internet for a couple minutes and does not receive a HEARTBEAT_ACK in time, then the user will no longer receive events. This is even if the user reconnects his or her internet.

Expected: Even if losing internet for hours of time, lib user will start receiving events again after reconnecting to the internet. Observed: Lib user no longer receives any events after the gun process shuts down. Even if the user reconnects to the internet.

Steps to reproduce:

  1. Start the nostrum application in developer mode.
  2. Once the gun process starts, disconnect from the internet.
  3. Wait until the aforementioned errors pop up.
  4. Reconnect to the internet.
  5. End result is that no events are received.

I'm thinking it's likely because we instruct gun to shutdown if we don't receive a HEARTBEAT_ACK in time. From what it looks like, we also never ask for gun to reopen its connection.

Credit to barkerja#9999 for bringing up the issue on the nostrum DAPI channel.

Kraigie commented 6 years ago

This has been fixed with 2a75a2fe86feda016c94be9a97cb15b2bc803ffa. Thanks for the report!

barkerja commented 6 years ago

Updated this morning, and did another quick test by simply shutting off WAN access to the bot. Received this error:

09:00:08.059 [error] GenServer #PID<0.691.0> terminating
** (KeyError) key :zlib_buffer not found
    (nostrum) lib/nostrum/shard/session.ex:169: Nostrum.Shard.Session.handle_info/2
    (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: {:gun_down, #PID<0.693.0>, :ws, :closed, [], []}
Kraigie commented 6 years ago

Do you have zlib_stream set to true in your config? I can see how this would be an issue if that is set to false. I'll take care of it when I'm around a desktop later today.

barkerja commented 6 years ago

Do you have zlib_stream set to true in your config?

Negative. All I have is:

config :nostrum,
  token: "xxxxx",
  num_shards: :auto
barkerja commented 6 years ago

Setting that config to true actually produces a different error:

** (Mix) Could not start application nostrum: Nostrum.start(:normal, []) returned an error: shutdown: failed to start child: Nostrum.Shard.Supervisor
    ** (EXIT) shutdown: failed to start child: 0
        ** (EXIT) shutdown: failed to start child: Nostrum.Shard.Session
            ** (EXIT) an exception was raised:
                ** (ErlangError) Erlang error: :bad_windowbits
                    :zlib.arg_bitsz/1
                    :zlib.inflateInit/3
                    (nostrum) lib/nostrum/shard/session.ex:68: Nostrum.Shard.Session.init/1
                    (stdlib) gen_server.erl:365: :gen_server.init_it/2
                    (stdlib) gen_server.erl:333: :gen_server.init_it/6
                    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Kraigie commented 6 years ago

I just pushed a fix for the former issue. For the latter, it looks like there's some fuckery afoot. You can check the discord channel if you want to see my struggle. Can I ask what os you're on?

barkerja commented 6 years ago

I'm running macOS 10.13.4.

❯ elixir -v                                                                                                                                                                                                                                                               [23:14:46]
Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Elixir 1.6.4 (compiled with OTP 20)

Also, I am on the Discord channel, I'm the reporter of the original issue. :)

barkerja commented 6 years ago

Appears to be 👍 -- both the initial issue of not reconnecting and the secondary issue key :zlib_buffer not found.

Thank you!