andersfylling / disgord

Go module for interacting with the documented Discord's bot interface; Gateway, REST requests and voice
BSD 3-Clause "New" or "Revised" License
496 stars 70 forks source link

Heartbeat process causes disconnection for bots with large guild membership #415

Open pravus opened 3 years ago

pravus commented 3 years ago

Describe the bug

When connecting to Discord with a large guild membership the heartbeat process will continually reconnect due to not receiving heartbeat ACKs. This appears to be caused by Discord prioritizing GUILD_CREATE events which are sent as a synchronization process after the READY event.

Expected behavior

Disgord should not trigger a reconnect attempt during guild synchronization even if heartbeat ACKs are delayed.

Error messages

This appears as normal heartbeat timeout errors after the initial connection setup is performed and the READY event has been received.

INFO[0000] Connecting to discord Gateway                
DEBU[0000] shard 0 intents: { Guilds | GuildMembers | GuildBans | GuildEmojis | GuildIntegrations | GuildWebhooks | GuildInvites | GuildVoiceStates | GuildMessages | GuildMessageReactions | GuildMessageTyping } 
DEBU[0000] shard 0 rejects events: [PRESENCE_UPDATE GUILD_MEMBER_UPDATE GUILD_MEMBER_REMOVE PRESENCES_REPLACE] 
DEBU[0000] shard 1 intents: { Guilds | GuildMembers | GuildBans | GuildEmojis | GuildIntegrations | GuildWebhooks | GuildInvites | GuildVoiceStates | GuildMessages | GuildMessageReactions | GuildMessageTyping } 
DEBU[0000] shard 1 rejects events: [PRESENCE_UPDATE GUILD_MEMBER_UPDATE GUILD_MEMBER_REMOVE PRESENCES_REPLACE] 
DEBU[0000] shard 2 intents: { Guilds | GuildMembers | GuildBans | GuildEmojis | GuildIntegrations | GuildWebhooks | GuildInvites | GuildVoiceStates | GuildMessages | GuildMessageReactions | GuildMessageTyping } 
DEBU[0000] shard 2 rejects events: [PRESENCE_UPDATE GUILD_MEMBER_UPDATE GUILD_MEMBER_REMOVE PRESENCES_REPLACE] 
DEBU[0000] shard 3 intents: { Guilds | GuildMembers | GuildBans | GuildEmojis | GuildIntegrations | GuildWebhooks | GuildInvites | GuildVoiceStates | GuildMessages | GuildMessageReactions | GuildMessageTyping } 
DEBU[0000] shard 3 rejects events: [PRESENCE_UPDATE GUILD_MEMBER_UPDATE GUILD_MEMBER_REMOVE PRESENCES_REPLACE] 
DEBU[0000] shard 4 intents: { Guilds | GuildMembers | GuildBans | GuildEmojis | GuildIntegrations | GuildWebhooks | GuildInvites | GuildVoiceStates | GuildMessages | GuildMessageReactions | GuildMessageTyping } 
DEBU[0000] shard 4 rejects events: [PRESENCE_UPDATE GUILD_MEMBER_UPDATE GUILD_MEMBER_REMOVE PRESENCES_REPLACE] 
DEBU[0000] shard 5 intents: { Guilds | GuildMembers | GuildBans | GuildEmojis | GuildIntegrations | GuildWebhooks | GuildInvites | GuildVoiceStates | GuildMessages | GuildMessageReactions | GuildMessageTyping } 
DEBU[0000] shard 5 rejects events: [PRESENCE_UPDATE GUILD_MEMBER_UPDATE GUILD_MEMBER_REMOVE PRESENCES_REPLACE] 
DEBU[0000] [ws-e,s:1,shard:3]trying to connect          
DEBU[0000] [shardSync]shard3is waiting to identify      
DEBU[0000] [ws-e,s:2,shard:3]waiting to send identify/resume 
DEBU[0000] [ws-e,s:4,shard:3]Ready to receive operation codes... 
DEBU[0000] [ws-e,s:3,shard:3]starting emitter           
DEBU[0000] [ws-e,s:5,shard:3]starting receiver          
DEBU[0000] [ws-e,s:6,shard:3]sendIdentityPacket is acquiring once channel 
DEBU[0000] [ws-e,s:7,shard:3]writing to once channel0xc00007eb40 
DEBU[0000] [ws-e,s:9,shard:3]finished writing to once channel0xc00007eb40 
DEBU[0000] [ws-e,s:8,shard:3]heartbeat ACK ok           
DEBU[0000] [ws-e,s:11,shard:3]sent heartbeat            
DEBU[0000] [ws-e,s:10,shard:3]sent identify/resume      
DEBU[0000] [shardSync]shard3waited and finished execution after171.316543ms 
DEBU[0000] [ws-e,s:12,shard:3]establishing connection succeeded 
DEBU[0000] [ws-e,s:1,shard:4]trying to connect          

...

INFO[0031] Connected                                    
DEBU[0041] [ws-e,s:13,shard:3]heartbeat ACK ok          
DEBU[0041] [ws-e,s:14,shard:3]sent heartbeat            
DEBU[0047] [ws-e,s:13,shard:4]heartbeat ACK ok          
DEBU[0047] [ws-e,s:14,shard:4]sent heartbeat            
DEBU[0053] [ws-e,s:13,shard:5]heartbeat ACK ok          
DEBU[0053] [ws-e,s:14,shard:5]sent heartbeat            
DEBU[0059] [ws-e,s:13,shard:0]heartbeat ACK ok          
DEBU[0059] [ws-e,s:14,shard:0]sent heartbeat            
DEBU[0067] [ws-e,s:13,shard:1]heartbeat ACK ok          
DEBU[0067] [ws-e,s:14,shard:1]sent heartbeat            
DEBU[0073] [ws-e,s:13,shard:2]heartbeat ACK ok          
DEBU[0073] [ws-e,s:14,shard:2]sent heartbeat            
INFO[0083] [ws-e,s:15,shard:3]heartbeat ACK was not received, forcing reconnect 
DEBU[0083] [ws-e,s:16,shard:3]stopping pulse            
DEBU[0083] [ws-e,s:17,shard:3]is reconnecting           
DEBU[0083] [ws-e,s:18,shard:3]closing operations handler 
DEBU[0083] [ws-e,s:19,shard:3]closing emitter           
INFO[0083] [ws-e,s:20,shard:3]disconnected              
DEBU[0083] [ws-e,s:21,shard:3]reconnecting failed: failed to close WebSocket: failed to write control frame opClose: failed to write frame: WebSocket closed: read timed out: context canceled 
DEBU[0083] [ws-e,s:22,shard:3]trying to connect         
DEBU[0083] [ws-e,s:23,shard:3]closing receiver          

...

Desktop (please complete the following information):

Additional context

I did some testing with event flow and found that Discord will block heartbeat ACKs during the guild synchronization phase of a new connection which leads to the timeout issues described above.

What appears to be happening is that a stream of GUILD_CREATE events will be received after the READY event. During this time a heartbeat request will be sent but no ACK will be received which triggers the reconnect logic. If the reconnect logic is disabled, multiple heartbeat ACKs are received after the guild synchronization is complete indicating that they are delayed.

I tested some possible solutions but didn't find any of them satisfactory:

1) Set lastHeartbeatAck for every GUILD_CREATE event. This resolves the issue by keeping the ACK timestamp fresh as events are received during guild synchronization but is philosophically wrong.

2) Set a flag that will ignore missed heartbeats after the READY event until the first ACK is received. This would effectively turn off the reconnect logic until after guild synchronization is complete but offers no connection tracking during this process.

3) Use a timer and set of flags to detect the guild synchronization process and only turn on heartbeats after it is complete. This is a more complex solution and technically still suffers from the drawback in solution 2.

One other issue is that you cannot turn heartbeats off completely or you will get read errors from the receiver which also trigger a reconnect. I haven't had the time to trace it down, but I'm assuming that Discord is dropping the websocket connection if no traffic is sent periodically. This means that even if heartbeat ACKs are ignored, heartbeat requests still need to be sent to prevent this error from occurring.

So I'm kind of at a loss as to what a proper solution should be. I'd be happy to post a PR for this if something reasonable could be suggested. I thought for now the first step would be to report it and go from there.

andersfylling commented 3 years ago

I think I'm going to push for the new gateway code instead of trying to patch this up, hoping that will clean up any issues: