Kraigie / nostrum

Elixir Discord Library
https://kraigie.github.io/nostrum/
MIT License
595 stars 128 forks source link

nostrum crashes on receiving particular event types #24

Closed ndac-todoroki closed 6 years ago

ndac-todoroki commented 7 years ago
  1. iex -S mix
  2. let bot join server
  3. kick bot out of server
  4. application exits!

sample log (from where bot gets kicked): gist

In dispatch.ex 's GUILD_DELETE pattern, it tries to get a p.unavailable boolean. But from the Discord document it says unavailable is true or not set. (hey is that what we call a boolean!?)

whether the guild is unavailable, should always be true. if not set, this signifies that the user was removed from the guild

So the dot-access method might be the very first problem here.

ndac-todoroki commented 7 years ago

It also dies when receiving a :MESSAGE_REACTION_REMOVE_ALL, but this time with a (FunctionClauseError) no function clause matching in Nostrum.Shard.Dispatch.handle_event/3. This maybe is because a MESSAGE_REACTION_REMOVE_ALL match is not present, and, this else line has wrong argument numbers.

And then it says [warn] Failed to create new guild process: {:already_registered, #PID<0.xxx.0>}, so maybe we need to do some kind of a cleanup before terminating GenStage? (using GenStage.terminate/2?)

ndac-todoroki commented 7 years ago

I tried @cookkkie 's branch and it worked, though now sometimes a new error occurs:

[debug] HEARTBEAT_ACK    # <- Kicked here
[debug] GUILD_ROLE_DELETE
[debug] GUILD_DELETE
[debug] GUILD_MEMBER_REMOVE
[error] GenServer #PID<0.340.0> terminating
** (stop) exited in: GenServer.call(#PID<0.382.0>, {:delete, :member, 346473476296540161, %{avatar: nil, bot: true, discriminator: "5185", id: 340932764952035330, username: "AqmBot"}}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir) lib/gen_server.ex:737: GenServer.call/3
    (nostrum) lib/nostrum/shard/dispatch.ex:135: Nostrum.Shard.Dispatch.handle_event/3
    (nostrum) lib/nostrum/shard/dispatch.ex:20: Nostrum.Shard.Dispatch.handle/1
    (elixir) lib/enum.ex:1229: Enum."-map/2-lists^map/1-0-"/2
    (nostrum) lib/nostrum/shard/stage/cache.ex:22: Nostrum.Shard.Stage.Cache.handle_events/3
    (gen_stage) lib/gen_stage.ex:2543: GenStage.consumer_dispatch/7
    (gen_stage) lib/gen_stage.ex:2688: GenStage.take_pc_events/3
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_consumer", {#PID<0.339.0>, #Reference<0.0.3.4647>}, [{%{d: %{guild_id: 346473476296540161, user: %{avatar: nil, bot: true, discriminator: "5185", id: 340932764952035330, username: "Aqm Bot"}}, op: 0, s: 28, t: :GUILD_MEMBER_REMOVE}, %{gateway: 'gateway.discord.gg', gun_pid: #PID<0.341.0>, heartbeat_ack: true, producer_pid: #PID<0.339.0>, reconnect_attempts: 0, seq: 27, session: "7a17a1d5d7fc137a0fd5aac4a100b552", shard_num: 0, shard_pid: #PID<0.338.0>, token: "MzQwOTMyNzY0OTUyMDM1MzMw.DGlNlg.0Jdlj33F6bCs-VaNKKOerAh7fFM"}}]}
State: []

Discord seems to send events not always in order, and when :GUILD_MEMBER_REMOVE:GUILD_DELETE it will succeed, but if GUILD_MEMBER_REMOVE comes afterwards it leads to the crash above.

cookkkie commented 7 years ago

This looks like a race condition. Seems like the guild registry didn't have the time to process this particular guild exit signal. What I'd suggest is to drop the use of GenServer.call in the GuildState module and just use send/2. The advantage of send is that even if the process isn't alive it'll not throw or something.

But if you actually want to completely get rid off that race condition, the Session process shouldn't ask the guild registry (since it's async). But it should keep a map guild_id -> guild_pid in its state and remove the pid from the map when it receive a GUILD_DELETE. That way will make "the guild discovery synchronous" and prevent RCs.