cronokirby / alchemy

A discord library for Elixir
MIT License
151 stars 34 forks source link

Events stop being received randomly until restart #62

Closed OvermindDL1 closed 4 years ago

OvermindDL1 commented 6 years ago

Every once in a while, seemingly randomly, could be after a few hours of uptime or a few weeks, events stop being passed to my callbacks/cogs. I can still send data to Discord and it appears in the channels fine, but I stop getting commands called, no callbacks for the on message hooks, nothing at all until I restart. Not finding any exceptions in the logs, etc....

OvermindDL1 commented 5 years ago

Are there any updates to this project? I keep running into this issue on long-running bots.

cronokirby commented 5 years ago

It's maintained, and I'm spending some time trying to bring things up to speed with the newer versions of libraries, so they'll be a minor release in the upcoming week or so.

That being said, with the long running issues you seem to be having, it's difficult to try and solve them without more information... I'm not really sure where to even start on these because the problems seem hard to reproduce.

OvermindDL1 commented 5 years ago

so they'll be a minor release in the upcoming week or so.

Ooo nice. My main issue right now is queries to map a name to an ID number (so I can map a @name to <@123456789>) keep dying, happens after a few weeks of uptime, though my own cache seems to deal with that well enough so people don't notice anymore. The main issue is when discord messages just stop arriving altogether so I no longer get messages, and yet I can still 'send' messages and commands fine, that seems to happen randomly anywhere from a few days to a month, though I work around it by having added a commend that kills and reloads the discord supervision tree in full for admins to use, still not at all clean.

I'm not really sure where to even start on these because the problems seem hard to reproduce.

Generally I'm quite good at debugging library related things, but everything I try on this seems to come down to some failure in the library<->discord communication, and I have no clue how to even really start there because the socket connection still seems active and all... A discord bug doesn't seem unlikely but the fact I'm not seeing it reported elsewhere either... :-/

cronokirby commented 5 years ago

Sending stuff and receiving stuff are independent, since sending stuff is just hitting a rest api, whereas receiving stuff requires a websocket connection, and is push based on discord's end. i.e. discord gracefully sends you stuff at random points. So if you're not receiving things after a certain point it has to do with the websocket connection in some way. It's possible that the websocket library might be at fault, but this would be hard to test. The problem with a bug like this is that you simply don't receive anything over the websocket after a certain point, for some reason...

OvermindDL1 commented 5 years ago

The problem with a bug like this is that you simply don't receive anything over the websocket after a certain point, for some reason...

Which is what I've seen, which is why I lean to a Discord bug, but not finding issues elsewhere online (perhaps people just don't have bots up as long as I do?).

Is there some way to 'ping' the websocket connection to get a pong back on occasion to confirm it is still communicating, and if not then restart it?

cronokirby commented 5 years ago

https://github.com/cronokirby/alchemy/blob/master/lib/Discord/Gateway/gateway.ex#L46

Normally this is actually part of the websocket connection, which is perplexing. You actually have to send back heartbeats at regular intervals back to discord, as well as whenever discord pings you on their side. Discord also acknowledges heartbeat acks through https://github.com/cronokirby/alchemy/blob/master/lib/Discord/Gateway/protocol.ex#L41

An interesting test to do would be to see if discord stops heartbeating us across the gateway, you could test this by logging all the heartbeat acks in a longer running bot.

Atm there's no logic to terminate the websocket connection if we don't receive a heartbeat ack in a timely fashion, but we could include that. That's something we should look into after testing that your issue includes not receiving acks after a while

OvermindDL1 commented 5 years ago

An interesting test to do would be to see if discord stops heartbeating us across the gateway, you could test this by logging all the heartbeat acks in a longer running bot.

Hmm, next time it happens I'll try to remember to hotswap in a logging line.

OvermindDL1 commented 4 years ago

I haven't seen it happen in a couple months now, and the code around it is unchanged... So maybe resolved? Could have just been discord themselves.

cronokirby commented 4 years ago

Closed for now, let's reopen if it pops up