Closed Th3-M4jor closed 3 months ago
This works but I feel like we're solving it a bit too high in the stack.
Shouldn't we instead forcibly :gun.close
and :gun.flush
for any
previous connection when we go down? We already have the :reconnect
internal event for connected
state.
Looking at the code, we do brutally close then flush the connection whenever when it goes down or otherwise be disconnected.
What could theoretically be happening is a race condition? Where we close and flush the connection on our end at the same time as it being closed on the other end and for some odd reason it doesn't end up in the process mailbox until after we've reconnected. Not sure how else this could be happening.
What could theoretically be happening is a race condition? Where we close and flush the connection on our end at the same time as it being closed on the other end and for some odd reason it doesn't end up in the process mailbox until after we've reconnected. Not sure how else this could be happening.
Hm, from my understanding flush should take care of it, otherwise it sounds like an upstream bug. Or maybe we are misusing it.
To debug this stuff having traces from :sys.trace
would really be
useful to know what the ratelimiter was cooking up in the meantime.
Ultimately I think the best solution is to write our own WebSocket implementation to fully understand what is going on. I am about half joking.
I've got logs for my bot going back to early May and after looking again, it seems this specific error only happened once to me 🤔. So its likely this was just one of those weird flukes, that likely wouldn't happen again for a long time.
Occasionally when my bot has to reconnect I would see 2 ready events in quick succession. I noticed from the below stacktrace that the PIDs for the gun connections didn't match and that caused the Shard to blow up.
Stacktrace
```erlang 2024-06-23 14:07:30.419 pid=<0.2610.0> [error] ** State machine <0.2610.0> terminating ** Last event = {info,{gun_down,<0.563597.0>,ws,normal, [#Ref<0.1002467860.1931214849.138526>]}} ** When server state = {connected,#{stream => #Ref<0.1002467860.1931214849.143966>, seq => 30, '__struct__' => 'Elixir.Nostrum.Struct.WSState', session => <<"e7c5b7a3c54504b725b76c1ffa68496a">>, conn => <0.564181.0>, gateway => <<"gateway-us-east1-d.discord.gg">>, resume_gateway => <<"wss://gateway-us-east1-d.discord.gg">>, shard_num => 0,total_shards => 1, heartbeat_ack => true, heartbeat_interval => 41250, last_heartbeat_ack => #{microsecond => {108488,6}, second => 21, calendar => 'Elixir.Calendar.ISO', month => 6, '__struct__' => 'Elixir.DateTime', utc_offset => 0,std_offset => 0, year => 2024,hour => 14,day => 23, zone_abbr => <<"UTC">>,minute => 7, time_zone => <<"Etc/UTC">>}, conn_pid => <0.2610.0>, last_heartbeat_send => #{microsecond => {164837,6}, second => 34, calendar => 'Elixir.Calendar.ISO', month => 6, '__struct__' => 'Elixir.DateTime', utc_offset => 0,std_offset => 0, year => 2024,hour => 14,day => 23, zone_abbr => <<"UTC">>,minute => 6, time_zone => <<"Etc/UTC">>}, compress_ctx => #Ref<0.1002467860.1927938051.205417>}} ** Reason for termination = error:function_clause ** Callback modules = ['Elixir.Nostrum.Shard.Session'] ** Callback mode = [state_functions,state_enter] ** Stacktrace = ** [{'Elixir.Nostrum.Shard.Session',connected, [info, {gun_down,<0.563597.0>,ws,normal, [#Ref<0.1002467860.1931214849.138526>]}, #{stream => #Ref<0.1002467860.1931214849.143966>,seq => 30, '__struct__' => 'Elixir.Nostrum.Struct.WSState', session => <<"e7c5b7a3c54504b725b76c1ffa68496a">>, conn => <0.564181.0>, gateway => <<"gateway-us-east1-d.discord.gg">>, resume_gateway => <<"wss://gateway-us-east1-d.discord.gg">>, shard_num => 0,total_shards => 1,heartbeat_ack => true, heartbeat_interval => 41250, last_heartbeat_ack => #{microsecond => {108488,6}, second => 21,calendar => 'Elixir.Calendar.ISO',month => 6, '__struct__' => 'Elixir.DateTime',utc_offset => 0, std_offset => 0,year => 2024,hour => 14,day => 23, zone_abbr => <<"UTC">>,minute => 7, time_zone => <<"Etc/UTC">>}, conn_pid => <0.2610.0>, last_heartbeat_send => #{microsecond => {164837,6}, second => 34,calendar => 'Elixir.Calendar.ISO',month => 6, '__struct__' => 'Elixir.DateTime',utc_offset => 0, std_offset => 0,year => 2024,hour => 14,day => 23, zone_abbr => <<"UTC">>,minute => 6, time_zone => <<"Etc/UTC">>}, compress_ctx => #Ref<0.1002467860.1927938051.205417>}], [{file,"lib/nostrum/shard/session.ex"},{line,324}]}, {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1395}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}] ** Time-outs: {1,[{state_timeout,send_heartbeat}]} ```