andersfylling / disgord

Go module for interacting with the documented Discord's bot interface; Gateway, REST requests and voice
BSD 3-Clause "New" or "Revised" License
496 stars 70 forks source link

LFUImmutable Cache deadlocks on CHANNEL_UPDATE #410

Closed pravus closed 3 years ago

pravus commented 3 years ago

Describe the bug It appears as though the default LFUImmutable Cache driver uses multiple locks when performing a channel update and will block the event queue causing heartbeat events to be missed and eventually a reconnect loop similar to the issue described here: https://github.com/andersfylling/disgord/issues/390

Expected behavior It is expected that LFUImmutable cache updates should never block the event queue and allow received events from Discord to be processed normally.

Error messages Similar to the issue stated above, here are log messages from Disgord when I was able to detect the lock condition:

DEBU[0041] [ws-e,s:13,shard:0]heartbeat ACK ok          
DEBU[0041] [ws-e,s:14,shard:0]sent heartbeat            
INFO[0083] [ws-e,s:15,shard:0]heartbeat ACK was not received, forcing reconnect 
DEBU[0083] [ws-e,s:16,shard:0]stopping pulse            
DEBU[0083] [ws-e,s:17,shard:0]is reconnecting           
DEBU[0083] [ws-e,s:18,shard:0]closing emitter           
DEBU[0083] [ws-e,s:19,shard:0]closing receiver          
INFO[0083] [ws-e,s:20,shard:0]disconnected              
DEBU[0083] [ws-e,s:21,shard:0]trying to connect         
DEBU[0083] [shardSync]shard0is waiting to identify      
DEBU[0083] [ws-e,s:22,shard:0]waiting to send identify/resume 
DEBU[0083] [ws-e,s:23,shard:0]starting emitter          
DEBU[0083] [ws-e,s:24,shard:0]Ready to receive operation codes... 
DEBU[0083] [ws-e,s:25,shard:0]starting receiver         
DEBU[0138] [ws-e,s:26,shard:0]closing receiver after read error 
ERRO[0263] [ws-e,s:27,shard:0]discord timeout during connect (3 minutes). No idea what went wrong.. 
DEBU[0263] [shardSync]shard0waited and finished execution after3m0.23654129s 
ERRO[0263] [ws-e,s:28,shard:0]establishing connection failed: websocket connected but was not able to send identify packet within 3 minutes 
INFO[0263] [ws-e,s:29,shard:0]next connection attempt in 3s 
DEBU[0266] [ws-e,s:30,shard:0]trying to connect         
ERRO[0266] [ws-e,s:31,shard:0]establishing connection failed: cannot Connect while a connection already exist 
INFO[0266] [ws-e,s:32,shard:0]next connection attempt in 7s 
DEBU[0273] [ws-e,s:33,shard:0]trying to connect         
ERRO[0273] [ws-e,s:34,shard:0]establishing connection failed: cannot Connect while a connection already exist 
INFO[0273] [ws-e,s:35,shard:0]next connection attempt in 11s 

Desktop (please complete the following information):

Additional context I was able to debug this by adding tracing statements to various parts of the client. It looks like what happens is that Discord sends a CHANNEL_UPDATE packet which passes through operationHandlers() in internal/gateway/client.go which will then call onDiscordEvent() in internal/gateway/eventclient.go via the action dispatch. This method sends an Event struct via c.eventChan which is read by the demultiplexer() method in reactor.go. Once an event is read it will be passed to cacheDispatcher() which will invoke the appropriate type-handler for the event on the Cache implementation.

In the ChannelUpdate() method defined for CacheLFUImmutable there is a closure named updateChannel which creates a new mutex on c.Channels for a given channel ID. If the channel already exists in the cache this will cause a deadlock because another lock is obtained before calling the closure.

I have found two workarounds for this:

1) Specify a CacheNop struct as the Cache handler when constructing a client instance.

2) Remove the mutex from the updateChannel closure in ChannelUpdate() in the CacheLFUImmutable implementation. While this change fixes my specific issue I am unsure if it is a proper solution.

andersfylling commented 3 years ago

Awesome bug report! I have completely removed LFUImmutable cache and replaced it with a way simpler cache system, as the default. The new cache has +70% test coverage and can be found in the develop branch. I'm going to make a release right now, but be aware it's a breaking upgrade.

andersfylling commented 3 years ago

Made a release. Feel free to ask any questions if you have migration issues.

andersfylling commented 3 years ago

@pravus just a checkup, did this change work for you?

pravus commented 3 years ago

Hi, @andersfylling.

Yes it does appear to be working. I bumped the version on my project to the new release and the problem I was seeing is now gone. I know you said it was a potentially breaking release but I didn't have to change anything other than the version of this module. Thanks for getting this out so quickly!