Open bkuhl opened 4 years ago
I'm not very familiar with golang, so I'm not going to submit a PR. But in Discord API's Discord server Dr. Abal
had some input:
Dr. Abal: 99.99% sure your issue is https://github.com/bwmarrin/discordgo/blob/b3cc83b/wsapi.go#L860 needs to be FormatCloseMessage(4000, "")
for OP7
Dr. Abal: the Close()
call at https://github.com/bwmarrin/discordgo/blob/b3cc83b/wsapi.go#L501
Does the 4000
value come from Gateway Close Event Codes defined by the Discord API?
If I am reading this right, I believe that 4000
event code would be something we receive from Discord, but the discordgo
code reads as though we're sending Discord a close message. Additionally, the close code that discordgo
is sending is a standard defined in RFC 6455, section 11.7.
The discordgo
code from the more up-to-date develop
branch:
Close
: https://github.com/bwmarrin/discordgo/blob/develop/wsapi.go#L865onEvent
: https://github.com/bwmarrin/discordgo/blob/develop/wsapi.go#L519The close code that discordgo
is sending:
I should admit that I have little experience with implementing websockets personally, but this looked like an interesting issue to dig into.
Also to note, I run a sharded bot that didn't seem to have this issue so far (only 10 shards though).
Yeah I am suspicious about about the close code, rather it's the issue or not. One way to test would be for someone to change it to 4000
on their bot and see if they continue to have an issue.
There's 3 situations here regarding closing/reconnecting, only 2 of them would the Close func be called.
A network connection error occurs, a discord server crashes, or something in the middle crashes, and the websocket is closed unexpectedly. In this case DiscordGo would not send a close packet at all. It would just try to reconnect.
Discord sends an Op9 asking us to reconnect. In this case, DiscordGo sends a close packet with a normal close status. There's no error condition, so why would we send an Unknown error
(4000) status? Maybe we don't even need to send a close packet in this case?
Bot is shutting down (CTRL-C, or otherwise) and has a call to Close() function. In this case there's no error and we should send a normal close status.
I changed the error code to 4000
on my bot with 100+ shards and it resolves my issue with resetting bot token. Right now the bot is up and running for 35 hours. Before the change, it was less than 24 hours.
You can find my changes here I can open PR if you want: https://github.com/lukasz-horonziak/discordgo/commit/1afd34e862b65a234b624b651436a68b09fd3226
@lukasz-horonziak thanks for the feedback, get us posted on how it goes over the next couple days.
Can I check if the fallout of this issue looks like this in the logging:
2020/04/09 01:20:12 [DG0] voice.go:357:wsListen() voice endpoint eu-central211.discord.media:80 websocket closed unexpectantly, websocket: close 4014: Disconnected.
2020/04/09 01:20:12 [DG0] voice.go:197:Close() error closing websocket, websocket: close sent
2020/04/09 01:20:13 [DG0] voice.go:206:Close() error closing websocket, write tcp 192.168.0.34:58593->162.159.130.235:443: i/o timeout
Recently I've been getting this on my bot which uses long running voice sessions, currently resulting in the bot hard failing since there doesn't appear to be any recovery/reconnect path in play.
That close 4014
looks interesting to me and it may be defined here:
Code | Description | Explanation |
---|---|---|
4014 | Disallowed intent(s) | You sent a disallowed intent for a Gateway Intent. You may have tried to specify an intent that you have not enabled or are not whitelisted for. |
Does that make sense in your case?
I'm not sure it does, the bot isn't doing anything special at the time.
It's outputting to the voice channel and sending/updating a message now and then to a single text channel. All of this works for between 20 minutes or several hours, but eventually ends with those messages logged, the bot leaving the voice channel suddenly and (I think) a broken session. It's hard to pin down, but seems to have started fairly recently based on user feedback. (Could be a false correlation, the user in question may have only started running it long term recently).
The code 4014
implies the bot attempted to do something that it's not allowed to based on it's configured intent, but I've not coded it do a lot outside of the above mentioned usage, which works (until it doesn't).
I'll be honest though, I don't know a lot about the discord API internals, I've largely just used discordgo
for a pet project and not really looked into how it works in much depth.
So I managed to replicate the issue, which may be related to this topic or not - if not I'll move to a new issue. After about 2 hours of the bot running with debug logging enabled, I got the following output:
2020/04/17 12:19:15 [DG3] wsapi.go:548:onEvent() got heartbeat ACK
2020/04/17 12:19:56 [DG3] wsapi.go:295:heartbeat() sending gateway websocket heartbeat seq 77
2020/04/17 12:19:56 [DG3] wsapi.go:498:onEvent() Op: 11, Seq: 0, Type: , Data: null
2020/04/17 12:19:56 [DG3] wsapi.go:548:onEvent() got heartbeat ACK
2020/04/17 12:20:37 [DG3] wsapi.go:295:heartbeat() sending gateway websocket heartbeat seq 77
2020/04/17 12:20:37 [DG3] wsapi.go:498:onEvent() Op: 11, Seq: 0, Type: , Data: null
2020/04/17 12:20:37 [DG3] wsapi.go:548:onEvent() got heartbeat ACK
2020/04/17 12:21:18 [DG3] wsapi.go:295:heartbeat() sending gateway websocket heartbeat seq 77
2020/04/17 12:21:20 [DG1] wsapi.go:229:listen() error reading from gateway wss://gateway.discord.gg/?v=6&encoding=json websocket, websocket: close 1006 (abnormal closure): unexpected EOF
2020/04/17 12:21:20 [DG2] wsapi.go:845:Close() called
2020/04/17 12:21:20 [DG2] wsapi.go:851:Close() closing listening channel
2020/04/17 12:21:20 [DG2] wsapi.go:861:Close() sending close frame
2020/04/17 12:21:21 [DG2] wsapi.go:874:Close() closing gateway websocket
2020/04/17 12:21:21 [DG2] wsapi.go:885:Close() emit disconnect event
2020/04/17 12:21:21 [DG2] wsapi.go:237:listen() calling reconnect() now
2020/04/17 12:21:21 [DG2] wsapi.go:845:Close() called
2020/04/17 12:21:21 [DG2] wsapi.go:885:Close() emit disconnect event
2020/04/17 12:21:21 [DG2] wsapi.go:845:Close() called
2020/04/17 12:21:21 [DG2] wsapi.go:885:Close() emit disconnect event
2020/04/17 12:21:21 [DG2] wsapi.go:845:Close() called
...
Note that the repeated lines continued around 580,000 times until I killed the bot having spotted my log file rapidly getting bigger.
I didn't see the 4014
error, but that is perhaps because I killed the bot, perhaps if I left it hammer away with whatever it was doing it would eventually hit 4014
having spammed the gateway too much?
Any thoughts on this? It's killing my streaming bots since they have only 30 minutes to a couple of hours of uptime before Discord triggers something that makes it enter this disconnect loop. I tried debugging it but since I have to wait up to a few hours before it triggers I sort of left it in the end.
@drunlade I'd suggest posting in the DiscordAPI Discord. There's a channel there for this library where you might be able to get more feedback/input on this.
@bkuhl I'll give that a go, thanks
The PR https://github.com/bwmarrin/discordgo/pull/761 was merged to fix this issue. Are you on the latest dev version @drunlade ? If not, first try updating to that and see if you're still having any problems. If you are - then let us know so we can revisit it.
@bwmarrin I did pull the latest develop when I saw that PR merge and rebuild with it, and I still had the issue, but I'll go back and make really sure I'm building with latest develop, just in case I managed to pull something that was outside of my build path or something stupid. Thanks.
@bwmarrin Well I grabbed the latest a bit back and built my project with it and still suffers the same issue - after 20 min or so the bot drops from the voice channel and goes into the infinite reconnect cycle. I'll have to create a mini project to replicate and share here I guess.
@bwmarrin Ok, so I literally stripped everything out of my bot untilI was left basically nothing much - honestly it may as well be a copy of the example bot from your own documentation!
Anyway, this bot will join the first voice channel it sees that has a user in and just sit there.
The result after a few (roughly 4 in this example, but the time varies) minutes:
I0825 17:31:43.118346 7804 main.go:20] Bot starting up
I0825 17:31:43.712170 7804 main.go:37] Bot is now running. Press CTRL-C to exit.
I0825 17:31:43.712170 7804 bot.go:69] Joining guild: dBotTestServer
I0825 17:31:43.712666 7804 bot.go:73] Joining voice channel
I0825 17:31:44.713471 7804 bot.go:88] Joined: 622758820153655310
2020/08/25 17:35:48 [DG0] wsapi.go:302:heartbeat() error sending heartbeat to gateway wss://gateway.discord.gg/?v=6&encoding=json, websocket: close sent
And one of my CPU cores getting maxed out 100%.
Could you try running with debug logging on (set your session's LogLevel
to LogDebug
)? The snippet you posted doesn't seem to really contain anything going wrong; a failure to send a heartbeat is recoverable and the recovery process should be detailed in the debug logging.
Additionally, I think you've caused an infinite loop with your Disconnect
handler. A heartbeat failure calls Close
, which calls CloseWithCode
, which emits a disconnect event; your handler then receives that event and calls Close
, which emits a disconnect event, which calls your handler...and so on.
@CarsonHoffman ... Oh well đŸ’© so I have, if it turns out that's the cause of the issue I'm going to be very annoyed with my self, since it's actually kinda obvious :(
The Problem Changes to Discord's socket handling this past Thursday caused any bots using this package at scale (may apply to lower usage as well, I'm not sure 100+) get their bot tokens forcibly reset by Discord due to too many authentication requests.
What Happened & The Fix This past Thursday, Discord rolled out a change to their API where shards should be expected to reconnect to the gateway in response to an event to enable the ability for Discord to more gracefully handling transferring active sessions to new hosts and such.
Here's the announcement from Discord:
Dont panic
Discord is rolling out a software update to our gateways. We anticipate no issues, and have had a canary of the new code running for a while now with no issue. This update introduces only one user visible change.
Your gateway connection will regularly be asked to reconnect using opcode: https://discordapp.com/developers/docs/topics/gateway#reconnect
You can simply reconnect and resume as you would tolerate the gateway connection being interrupted via other causes. We are doing this to allow for organic rebalancing of long-lived connections between nodes as nodes eventually start to auto-scale in/out.