Closed schnapster closed 7 years ago
Resuming does not use up the rate limit FYI. Jake explicitly pointed that out
JDA does both, resuming and a full login. They told me the full thing happens when theres a (Re)Loading
logged, and so far logs are also right about - when we are restarting shards, and JDA (Re)loads
some of them too, we run into the discord rate limit.
Here's another log from today:
Watchdog starts a few revive tasks, then while those are going JDA (Re)loads
a few shards and ratelimits happen:
[05:22:13] [ WARN ] [ShardWatchdogAgent] Reviving shard [26 / 309] after 60 seconds of no events. Last event received was UserOnlineStatusUpdateEvent
[05:22:13] [ INFO ] [FredBoatBot] Reviving shard 25
[05:22:13] [ WARN ] [ShardWatchdogAgent] Reviving shard [28 / 309] after 60 seconds of no events. Last event received was UserOnlineStatusUpdateEvent
[05:22:13] [ INFO ] [FredBoatBot] Reviving shard 26
[05:22:13] [ WARN ] [ShardWatchdogAgent] Reviving shard [29 / 309] after 60 seconds of no events. Last event received was MessageEmbedEvent
[05:22:13] [ INFO ] [FredBoatBot] Reviving shard 28
[05:22:13] [ WARN ] [ShardWatchdogAgent] Reviving shard [30 / 309] after 60 seconds of no events. Last event received was UserOnlineStatusUpdateEvent
[05:22:13] [ INFO ] [FredBoatBot] Reviving shard 29
[05:22:13] [ WARN ] [ShardWatchdogAgent] Reviving shard [31 / 309] after 60 seconds of no events. Last event received was UserGameUpdateEvent
[05:22:13] [ INFO ] [FredBoatBot] Reviving shard 30
[05:22:13] [ INFO ] [FredBoatBot] Reviving shard 31
[05:22:14] [ WARN ] [ShardWatchdogAgent] Reviving shard [186 / 309] after 61 seconds of no events. Last event received was UserGameUpdateEvent
[05:22:14] [ INFO ] [FredBoatBot] Reviving shard 186
[05:22:14] [ WARN ] [ShardWatchdogAgent] Reviving shard [198 / 309] after 61 seconds of no events. Last event received was UserOnlineStatusUpdateEvent
[05:22:14] [ INFO ] [FredBoatBot] Reviving shard 198
[05:22:17] [ INFO ] [JDA] Finished (Re)Loading!
[05:22:18] [ INFO ] [FredBoat] Coin for shard 186
[05:22:18] [ INFO ] [JDA] Finished (Re)Loading!
[05:22:20] [ INFO ] [JDA] Finished Loading!
[05:22:22] [ INFO ] [JDA] Finished (Re)Loading!
[05:22:22] [ INFO ] [JDA] Finished Loading!
[05:22:23] [ INFO ] [FredBoat] Coin for shard 5
[05:22:28] [ INFO ] [JDA] Finished Loading!
[05:22:28] [ INFO ] [FredBoat] Coin for shard 29
[05:22:30] [ INFO ] [JDA] Finished Loading!
[05:22:34] [ INFO ] [FredBoat] Coin for shard 26
[05:22:35] [ ERROR] [FredBoatBot] Got rate limited while building bot JDA instance! Retrying...
[05:22:35] [ INFO ] [JDA] Finished Loading!
[05:22:39] [ INFO ] [FredBoat] Coin for shard 8
[05:22:40] [ ERROR] [FredBoatBot] Got rate limited while building bot JDA instance! Retrying...
[05:22:41] [ INFO ] [JDA] Finished (Re)Loading!
[05:22:44] [ INFO ] [FredBoat] Coin for shard 216
[05:22:47] [ INFO ] [JDA] Finished (Re)Loading!
[05:22:49] [ INFO ] [FredBoat] Coin for shard 30
[05:22:50] [ ERROR] [FredBoatBot] Got rate limited while building bot JDA instance! Retrying...
[05:22:52] [ INFO ] [JDA] Finished (Re)Loading!
It's better (probably by a lot and good enough), but not fixed entirely. If a bunch of shards were to be manually revived while jda is (re)loading too, then it may hit the rate limit.
So the coin system works ok, but JDA itself does some reconnecting too which uses up the rates.
Find a way to work together with JDA (the hardcore option would be to turn off the reconnects and handle them fully ourselves). We might also discuss whether the original issue what we added the watchdog for has been resolved (or maybe a misunderstanding on our side about a lot of shards taking up a lot of time to get revived by JDA?).
Below is a problematic log from today showing how both JDA and our own system try to start shards resulting in rate limit hits (I think I grepped all relevant actors there): https://hastebin.com/fayanayaze.css JDA uses up a login when it logs
(Re)Loading
The first batch (at 10:16 log time) was triggered by a stop the world garbage collection as there are a few seconds missing in the log:and there is a corresponding STW event in the gc log. So the Watchdog started killing the shards from shard 0 onwards, but stopped killing them as they started getting events again.
The second one appears to be a network issue where some of the shards could be resumed by JDA without a login, some were resumed using up a log in, and some were killed by the Watchdog. The problem here probably is that reviving so many shards takes up so much time for JDA (who appears to rate limits logins itself, probably by adhering to what discord reports back as left ratelimits when you make a request), that the shards get picked up as inactive by the watchdog.
Solutions to explore: