Handle case of non matching gun conn pids when receiving a gun_down message

Th3-M4jor commented 3 months ago

Occasionally when my bot has to reconnect I would see 2 ready events in quick succession. I noticed from the below stacktrace that the PIDs for the gun connections didn't match and that caused the Shard to blow up.

Stacktrace

```erlang 2024-06-23 14:07:30.419 pid=<0.2610.0> [error] ** State machine <0.2610.0> terminating ** Last event = {info,{gun_down,<0.563597.0>,ws,normal, [#Ref<0.1002467860.1931214849.138526>]}} ** When server state = {connected,#{stream => #Ref<0.1002467860.1931214849.143966>, seq => 30, '__struct__' => 'Elixir.Nostrum.Struct.WSState', session => <<"e7c5b7a3c54504b725b76c1ffa68496a">>, conn => <0.564181.0>, gateway => <<"gateway-us-east1-d.discord.gg">>, resume_gateway => <<"wss://gateway-us-east1-d.discord.gg">>, shard_num => 0,total_shards => 1, heartbeat_ack => true, heartbeat_interval => 41250, last_heartbeat_ack => #{microsecond => {108488,6}, second => 21, calendar => 'Elixir.Calendar.ISO', month => 6, '__struct__' => 'Elixir.DateTime', utc_offset => 0,std_offset => 0, year => 2024,hour => 14,day => 23, zone_abbr => <<"UTC">>,minute => 7, time_zone => <<"Etc/UTC">>}, conn_pid => <0.2610.0>, last_heartbeat_send => #{microsecond => {164837,6}, second => 34, calendar => 'Elixir.Calendar.ISO', month => 6, '__struct__' => 'Elixir.DateTime', utc_offset => 0,std_offset => 0, year => 2024,hour => 14,day => 23, zone_abbr => <<"UTC">>,minute => 6, time_zone => <<"Etc/UTC">>}, compress_ctx => #Ref<0.1002467860.1927938051.205417>}} ** Reason for termination = error:function_clause ** Callback modules = ['Elixir.Nostrum.Shard.Session'] ** Callback mode = [state_functions,state_enter] ** Stacktrace = ** [{'Elixir.Nostrum.Shard.Session',connected, [info, {gun_down,<0.563597.0>,ws,normal, [#Ref<0.1002467860.1931214849.138526>]}, #{stream => #Ref<0.1002467860.1931214849.143966>,seq => 30, '__struct__' => 'Elixir.Nostrum.Struct.WSState', session => <<"e7c5b7a3c54504b725b76c1ffa68496a">>, conn => <0.564181.0>, gateway => <<"gateway-us-east1-d.discord.gg">>, resume_gateway => <<"wss://gateway-us-east1-d.discord.gg">>, shard_num => 0,total_shards => 1,heartbeat_ack => true, heartbeat_interval => 41250, last_heartbeat_ack => #{microsecond => {108488,6}, second => 21,calendar => 'Elixir.Calendar.ISO',month => 6, '__struct__' => 'Elixir.DateTime',utc_offset => 0, std_offset => 0,year => 2024,hour => 14,day => 23, zone_abbr => <<"UTC">>,minute => 7, time_zone => <<"Etc/UTC">>}, conn_pid => <0.2610.0>, last_heartbeat_send => #{microsecond => {164837,6}, second => 34,calendar => 'Elixir.Calendar.ISO',month => 6, '__struct__' => 'Elixir.DateTime',utc_offset => 0, std_offset => 0,year => 2024,hour => 14,day => 23, zone_abbr => <<"UTC">>,minute => 6, time_zone => <<"Etc/UTC">>}, compress_ctx => #Ref<0.1002467860.1927938051.205417>}], [{file,"lib/nostrum/shard/session.ex"},{line,324}]}, {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1395}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}] ** Time-outs: {1,[{state_timeout,send_heartbeat}]} ```

jchristgit commented 3 months ago

This works but I feel like we're solving it a bit too high in the stack. Shouldn't we instead forcibly :gun.close and :gun.flush for any previous connection when we go down? We already have the :reconnect internal event for connected state.

Th3-M4jor commented 3 months ago

Looking at the code, we do brutally close then flush the connection whenever when it goes down or otherwise be disconnected.

What could theoretically be happening is a race condition? Where we close and flush the connection on our end at the same time as it being closed on the other end and for some odd reason it doesn't end up in the process mailbox until after we've reconnected. Not sure how else this could be happening.

jchristgit commented 3 months ago

What could theoretically be happening is a race condition? Where we close and flush the connection on our end at the same time as it being closed on the other end and for some odd reason it doesn't end up in the process mailbox until after we've reconnected. Not sure how else this could be happening.

Hm, from my understanding flush should take care of it, otherwise it sounds like an upstream bug. Or maybe we are misusing it.

To debug this stuff having traces from :sys.trace would really be useful to know what the ratelimiter was cooking up in the meantime.

Ultimately I think the best solution is to write our own WebSocket implementation to fully understand what is going on. I am about half joking.

Th3-M4jor commented 3 months ago

I've got logs for my bot going back to early May and after looking again, it seems this specific error only happened once to me 🤔. So its likely this was just one of those weird flukes, that likely wouldn't happen again for a long time.

Kraigie / nostrum

Handle case of non matching gun conn pids when receiving a gun_down message #611