discord-net / Discord.Net

An unofficial .Net wrapper for the Discord API (https://discord.com/)
https://discordnet.dev
MIT License
3.33k stars 736 forks source link

Discord.Net doesn't automatically reconnect after network disconnect #960

Closed JustArchi closed 2 years ago

JustArchi commented 6 years ago
2018-02-22 14:06:21|INFO|ArchiBoT|OnLog() Discord | Discord.Net v2.0.0-beta (API v6)
(...)
2018-02-22 15:26:11|WARN|ArchiBoT|OnLog() Gateway |
2018-02-22 15:26:11|WARN|ArchiBoT|OnLog() System.Exception: Server missed last heartbeat
   at Discord.ConnectionManager.<>c__DisplayClass28_0.<<StartAsync>b__0>d.MoveNext()
2018-02-22 15:26:11|INFO|ArchiBoT|OnLog() Gateway | Disconnecting
2018-02-22 15:26:12|INFO|ArchiBoT|OnLog() Gateway | Disconnected
2018-02-22 15:26:13|INFO|ArchiBoT|OnLog() Gateway | Connecting
2018-02-22 15:27:58|INFO|ArchiBoT|OnLog() Gateway | Disconnecting
2018-02-22 15:27:58|INFO|ArchiBoT|OnLog() Gateway | Disconnected

Library properly attempted to reconnect after missing last heartbeat, but it seems like it failed with timeout after those ~45 seconds and decided to stop further attempts. According to https://github.com/RogueException/Discord.Net/commit/73ac9d7886aa48b9d809c56e51945056f3b67232 - similar issue should be already solved for 2.0.0-beta, but it seems that this issue is still recent in one form or another.

It should definitely keep on trying until it succeeds, unless this is somehow intended (in this case how we should handle it ourselves?)

Thank you in advance.

RheaAyase commented 6 years ago

Not the first type of not-reconnecting issue... I just kill the whole application and restart it on service level when it disconnects to prevent the library ever failing to reconnect... :woman_shrugging:

JustArchi commented 6 years ago

It'd be really nice if those core issues could be solved for good. Doing restarts on application level completely defeats the purpose of using a library in the first place.

FiniteReality commented 6 years ago

Handling all the edge cases in the reconnect logic is hard - especially when discord outages are concerned, so I doubt we'll ever get all of them. This is definitely something we'll look into though.

foxbot commented 6 years ago

Regarding above, neither myself nor the other core contributors have ever been able to reproduce these connection deadlocks (with the exception of the discord outage deadlock), which makes it very difficult to diagnose.

I've looked over the ConnectionManager code myself numerous times and everything checks out to me - that being said, multiple eyes never hurts.

Unless you can create a set of circumstances that makes this issue reproducible in our environments, or identify where in the library the deadlock is sourcing from, there isn't really much more we can do about this issue until one of us finally encounters it.

JustArchi commented 6 years ago

I understand that and appreciate your time spent on this, since I guess everybody here knows how hard it is to debug an issue that you can't possibly reproduce in your own environment. Sadly there is nothing much I can help with personally, as I've never looked into discord.net code until now, so I wouldn't even know where to start looking.

Unless you can create a set of circumstances that makes this issue reproducible in our environments

Well, sometimes stupid ideas are the best ones. Have you tried to unplug the network cable and plug it back in a few times in various periods to check if library always reconnects properly? 😀

foxbot commented 6 years ago

I tried killing my local connection three times, and all three times the client attempted to reconnect, recognized a WinHttpException, and started exponential backoff.

For reference, what OS and .NET Core version are you running?

JustArchi commented 6 years ago

Running latest nightly on Debian Testing right now, but that issue also happened on 2.0 stable and all sort of other versions I've been testing to ensure it's not my own fault.

.NET Command Line Tools (2.1.300-preview2-008293)

Product Information:
 Version:            2.1.300-preview2-008293
 Commit SHA-1 hash:  68922e2a51

Runtime Environment:
 OS Name:     debian
 OS Version:
 OS Platform: Linux
 RID:         debian-x64
 Base Path:   /opt/dotnet/sdk/2.1.300-preview2-008293/

Host (useful for support):
  Version: 2.1.0-preview2-26227-01
  Commit:  86d1f92013

.NET Core SDKs installed:
  2.1.300-preview2-008293 [/opt/dotnet/sdk]

The.NET Core runtimes installed:
  Microsoft.AspNetCore.App 2.1.0-preview2-30171 [/opt/dotnet/shared]
  Microsoft.NETCore.App 2.1.0-preview2-26227-01 [/opt/dotnet/shared]
  Microsoft.AspNetCore.All 2.1.0-preview2-30171 [/opt/dotnet/shared]
vaindil commented 6 years ago

I get this problem every few days as well. I've been on the 2.0 beta, most recently on 2.0.0-beta2-00890. This is the usual stack trace:

System.Exception: WebSocket connection was closed ---> System.Net.WebSockets.WebSocketException: The remote party closed the WebSocket connection without completing the close handshake. ---> System.Net.WebSockets.WebSocketException: The remote party closed the WebSocket connection without completing the close handshake.
   at System.Net.WebSockets.ManagedWebSocket.<EnsureBufferContainsAsync>d__70.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Net.WebSockets.ManagedWebSocket.<ReceiveAsyncPrivate>d__61.MoveNext()
   --- End of inner exception stack trace ---
   at System.Net.WebSockets.ManagedWebSocket.<ReceiveAsyncPrivate>d__61.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Discord.Net.WebSockets.DefaultWebSocketClient.<RunAsync>d__33.MoveNext()
   --- End of inner exception stack trace ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Discord.ConnectionManager.<>c__DisplayClass28_0.<<StartAsync>b__0>d.MoveNext()
WorkingRobot commented 6 years ago

I can reproduce.

System.Exception: WebSocket connection was closed ---> System.Net.WebSockets.WebSocketException: The remote party closed the WebSocket connection without completing the close handshake.
   at System.Net.WebSockets.WebSocketBase.WebSocketOperation.<Process>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Net.WebSockets.WebSocketBase.<ReceiveAsyncCore>d__45.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Discord.Net.WebSockets.DefaultWebSocketClient.<RunAsync>d__33.MoveNext()
   --- End of inner exception stack trace ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Discord.ConnectionManager.<>c__DisplayClass28_0.<<StartAsync>b__0>d.MoveNext()

This happens once in a while, but it reconnects after 1-2 seconds.

vaindil commented 6 years ago

I've started getting this stack trace, and the bot does not reconnect when it happens.

Gateway: System.Exception: Server missed last heartbeat
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Discord.ConnectionManager.<>c__DisplayClass28_0.<<StartAsync>b__0>d.MoveNext()
RheaAyase commented 6 years ago

Server missed last heartbeat

The last time I didn't get this one a few times a day was 0.9x ;)

z0Kng commented 6 years ago

I can reproduce too:

01:56:09,735 [49] INFO - 01:56:09 Gateway     Failed to resume previous session
01:56:09,735 [49] INFO - 01:56:09 Gateway     Disconnecting
01:56:09,750 [87] INFO - 01:56:09 Gateway     Disconnected

nothing happend after Disconnected. I will try to get more information about the disconnect.

MrRockboy222 commented 6 years ago

I'm also experiencing this problem. Disconnect reasons can vary. Sometimes I get System.Exception: Server missed last heartbeat And the bot just disconnects with no attempt to reconnect.

Sometimes a WebSocketException occurs, which causes the bot to disconnect, then reconnect. However, sometimes if the bot can't resume the previous session, it'll just disconnect. The log for that is similar to the above comment

Disconnects are very inconsistent. Sometimes, the bot will successfully recover, sometimes it won't. My bot only runs on one server at the moment. I've noticed that it will disconnect sometimes if someone is using many commands, with a small time gap in between. I believe this has always been due to the heartbeat issue mentioned in the first paragraph

I'm running the bot on an AWS EC2 T2.Micro. The OS is a server version of Ubuntu. Here's some basic specs

Kernel: x86_64 Linux 4.4.0-1054-aws
Uptime: 3d 6h 6m
Packages: 523
Shell: bash 4.3.48
CPU: Intel Xeon CPU E5-2676 v3 @ 2.4GHz
RAM: 305MiB / 990MiB

I'd be happy to provide anymore information if needed

RheaAyase commented 6 years ago

Technical specification does not matter (neither does kernel,) if anything, it would be RID[1] which is a string that identifies the platform and .NET Core may behave differently... And .NET Core version. (Runtime version, versions of the different components are... different.)

[1] RID: https://docs.microsoft.com/en-us/dotnet/core/rid-catalog

MrRockboy222 commented 6 years ago

I ran the dotnet --info command, and it returned this:

.NET Command Line Tools (2.0.0)

Product Information:
 Version:            2.0.0
 Commit SHA-1 hash:  cdcd1928c9

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  16.04
 OS Platform: Linux
 RID:         ubuntu.16.04-x64
 Base Path:   /usr/share/dotnet/sdk/2.0.0/

Microsoft .NET Core Shared Framework Host

  Version  : 2.0.0
  Build    : e8b8861ac7faf042c87a5c2f9f2d04c98b69f28d

I am also using Discord.NET version 1.0.2

A couple of things to note here. Firstly, I need to update .NET Core. Not entirely sure why it wasn't updated in the first place, but I'll get on that. Secondly, there is a beta version of Discord.NET, which I will try out. I don't suspect doing both of these will help since other people have been reporting this issue with both updated. Maybe I'll get a nice surprise though.

RheaAyase commented 6 years ago

I need to update .NET Core. Not entirely sure why it wasn't updated in the first place, but I'll get on that.

If you install it from microsoft's website it won't be updated.

z0Kng commented 6 years ago

I got the issue again. This time, i logged the Disconnect event:

07:37:38,029 [15] INFO - 07:37:38 Gateway     Failed to resume previous session
07:37:38,029 [15] INFO - 07:37:38 Gateway     Disconnecting
07:37:38,031 [76] INFO - 07:37:38 Gateway     Disconnected
07:37:38,031 [76] ERROR - DiscordClient lost Connection: A task was canceled.
Exeption: 

Maybe i should do the same as RheaAyase and completely restart it when it disconnects.

RheaAyase commented 6 years ago

Maybe i should do the same as RheaAyase an completely restart it when it disconnects.

It's still a hack, a workaround (an ugly one at that.) The bug remains.

z0Kng commented 6 years ago

It's still a hack, a workaround (an ugly one at that.) The bug remains.

I know it's not a "clean" solution, but better than if the bot doesn't run properly for half a day.

RheaAyase commented 6 years ago

Well it's quite a bit of an issue for bigger bots - restart takes a while.

MrRockboy222 commented 6 years ago

Well look what happened: System.Exception: Server missed last heartbeat One interesting thing that happened this time is that it logged the disconnecting message, but never logged the disconnected message. This happened after I used several commands in a short amount of time.

HunterGraubard commented 6 years ago

I've just started using Discord.Net in a project and have immediately noticed this issue. I have a bot that starts up and runs LoginAsync() and then StartAsync(). It connects and works fine. However, after some number of hours, I notice the bot has gone offline on Discord, and my console log displays Gateway: Disconnected. That's it. There are no other messages about a heartbeat, and it literally never, ever, ever attempts to reconnect. It always disconnects after a while, and it never reconnects.

When this happens, the bot stays offline on Discord, and no longer responds to PMs. However, curiously, it is still able to send automated messages to a channel via SendMessageAsync() despite being offline.

My only solution to get the bot fully online again is to restart it completely. This is quite annoying to have to do every single day. Sometimes I have to do it multiple times per day.

RheaAyase commented 6 years ago

@HunterGraubard run it as a service, and in the disconnected event simply Exit() - the service will restart it again... an awful workaround that will do the trick...

FiniteReality commented 6 years ago

There's a known deadlock somewhere in our code, but we're not sure where it is. I'm not sure if we'll be able to fix this bug before 3.0, unfortunately.

XtremeOwnageDotCom commented 6 years ago

Make sure its a well documented bug... Since, it is a extremely significant instability issue.

MrSapps commented 6 years ago

if its a deadlock and you get a process dump won't that help to debug the issue?

foxbot commented 5 years ago

Is this still an issue? As of the latest release, I'm able to disconnect the network and reconnect fine

21:17:22 Gateway     Connecting
21:17:23 Gateway     Connected
21:17:23 Gateway     Ready
21:18:23 Gateway     System.Exception: WebSocket connection was closed // ... stacktrace omitted for brevity
21:18:23 Gateway     Disconnecting
21:18:23 Gateway     Disconnected
21:18:24 Gateway     Connecting
21:18:34 Gateway     Resumed previous session
21:18:34 Gateway     Connected
XtremeOwnageDotCom commented 5 years ago

I will.... remove the code which forces a full bot restart, and give it a try.

NovusTheory commented 5 years ago

I still have issues with deadlocking.

Whatever the websocket closure was. It could be server missed heartbeat, closed normally, or anything else. It will just stay permanently disconnected.

What's strange is that my bot will say it's disconnecting from the gateway, disconnect all the servers. But it stays online as if it was still connected and heartbeating. Ofc it can't respond to commands though as it's "disconnected". I don't believe this is an issue with my code as I don't get those "Blocking gateway" messages.

I still say there is a problem somewhere in the librarie's code. Sometimes it disconnects and reconnects fine sometimes it doesn't. More often then not I'm restarting it manually almost consistently every other day.

4egod commented 5 years ago

Hi. I've updated to 2.0.1 and got this again ( 02.27-22:46:10 DiscordBot Gateway Latency = 266 ms 02.27-22:46:48 DiscordBot Gateway Received Dispatch (PRESENCE_UPDATE) 02.27-22:46:51 DiscordBot Gateway Received Dispatch (PRESENCE_UPDATE) 02.27-22:46:51 DiscordBot Gateway Sent Heartbeat 02.27-22:46:52 DiscordBot Gateway Received HeartbeatAck 02.27-22:46:52 DiscordBot Gateway Latency = 281 ms 02.27-22:47:05 DiscordBot Gateway Heartbeat Stopped 02.27-22:47:05 DiscordBot Gateway

at Discord.ConnectionManager.<>c__DisplayClass29_0.<b__0>d.MoveNext()

02.27-22:47:05 DiscordBot Gateway Disconnecting 02.27-22:47:05 DiscordBot Gateway Disconnecting ApiClient 02.27-22:47:05 DiscordBot Gateway Waiting for heartbeater 02.27-22:47:05 DiscordBot Gateway Waiting for guild downloader 02.27-22:47:05 DiscordBot Gateway Clearing large guild queue 02.27-22:47:05 DiscordBot Gateway Raising virtual GuildUnavailables 02.27-22:47:05 DiscordBot Gateway Disconnected from Bitsum Test 02.27-22:47:05 DiscordBot Gateway Disconnected from Bitsum 02.27-22:47:05 DiscordBot WebSocket connection was closed 02.27-22:47:05 DiscordBot at Discord.ConnectionManager.<>c__DisplayClass29_0.<b__0>d.MoveNext() 02.27-22:47:06 DiscordBot Rest GET voice/regions: 672.66 ms 02.27-22:47:08 DiscordBot Gateway A Disconnected handler is blocking the gateway task.

Aviat0r93 commented 5 years ago

We also have this problem since a few days now. The bot was running fine for a few months without any issues. Then the log file was too big (a few gigabytes). So we deleted the log file and started with a new one.

I don't know if this is the problem, but in other issues I read, that logging makes some problems.

This is what is written in our log file before the disconnect happened.

21/04/2019 05:10:43 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:44 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:45 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:46 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:46 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:46 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:48 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:48 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:48 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:49 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:49 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:49 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:50 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:51 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:51 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:51 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:52 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:53 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:53 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:54 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:54 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:55 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:55 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:55 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:56 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:56 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:10:58 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:11:00 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:11:01 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 05:11:01 : Heartbeat Stopped
21/04/2019 05:11:01 : 
21/04/2019 05:11:01 : Disconnecting
21/04/2019 05:11:01 : Disconnecting ApiClient
21/04/2019 05:11:01 : Waiting for heartbeater
21/04/2019 05:11:01 : Waiting for guild downloader
21/04/2019 05:11:01 : Clearing large guild queue
21/04/2019 05:11:01 : Raising virtual GuildUnavailables
21/04/2019 05:11:01 : Disconnected from Server
21/04/2019 05:11:01 : Disconnected
21/04/2019 05:11:02 : Connecting
21/04/2019 05:11:02 : Connecting ApiClient
21/04/2019 05:11:02 : GET gateway: 59,15 ms

The reconnect never happened like it was the case at 2.49am. However, there were a lot of those PRESENCE_UPDATE log entries. In an update of our bot I'm ignoring them now.

image

This is the part of the log when the disconnect happened at 2.49am:

21/04/2019 02:49:01 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:01 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:01 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:02 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:02 : Heartbeat Stopped
21/04/2019 02:49:02 : 
21/04/2019 02:49:02 : Disconnecting
21/04/2019 02:49:02 : Disconnecting ApiClient
21/04/2019 02:49:02 : Waiting for heartbeater
21/04/2019 02:49:02 : Waiting for guild downloader
21/04/2019 02:49:02 : Clearing large guild queue
21/04/2019 02:49:02 : Raising virtual GuildUnavailables
21/04/2019 02:49:02 : Disconnected from Server
21/04/2019 02:49:02 : Disconnected
21/04/2019 02:49:03 : Connecting
21/04/2019 02:49:03 : Connecting ApiClient
21/04/2019 02:49:03 : 
21/04/2019 02:49:03 : Disconnecting
21/04/2019 02:49:03 : Disconnecting ApiClient
21/04/2019 02:49:03 : Waiting for heartbeater
21/04/2019 02:49:03 : Waiting for guild downloader
21/04/2019 02:49:03 : Clearing large guild queue
21/04/2019 02:49:03 : Raising virtual GuildUnavailables
21/04/2019 02:49:03 : Disconnected
21/04/2019 02:49:05 : Connecting
21/04/2019 02:49:05 : Connecting ApiClient
21/04/2019 02:49:05 : GET gateway: 43,7 ms
21/04/2019 02:49:05 : Received Hello
21/04/2019 02:49:05 : Heartbeat Started
21/04/2019 02:49:05 : Resuming
21/04/2019 02:49:05 : Sent Heartbeat
21/04/2019 02:49:05 : Sent Resume
21/04/2019 02:49:06 : Received HeartbeatAck
21/04/2019 02:49:06 : Latency = 109 ms
21/04/2019 02:49:06 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:06 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:06 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:06 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:06 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:06 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:06 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:06 : Received Dispatch (RESUMED)
21/04/2019 02:49:06 : Connected to Server
21/04/2019 02:49:06 : Resumed previous session
21/04/2019 02:49:06 : Sending Status
21/04/2019 02:49:06 : Sent StatusUpdate
21/04/2019 02:49:06 : Connected
21/04/2019 02:49:06 : Raising Event
21/04/2019 02:49:06 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:06 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:07 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:07 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:07 : Received Dispatch (PRESENCE_UPDATE)
21/04/2019 02:49:08 : Received Dispatch (PRESENCE_UPDATE)

I hope this helps somehow.

DementedMistahJ commented 5 years ago
2019-04-09 15:57:35.381 +00:00 [INF] 15:57:35 Gateway     Disconnecting
2019-04-09 15:57:35.403 +00:00 [INF] 15:57:35 Gateway     Disconnected
2019-04-09 15:57:36.403 +00:00 [INF] 15:57:36 Gateway     Connecting
2019-04-09 16:01:51.099 +00:00 [INF] 16:01:51 Gateway     System.Net.WebSockets.WebSocketException (2): Unable to connect to the remote server ---> System.Net.Http.HttpRequestException: Connection timed out ---> System.Net.Sockets.SocketException: Connection timed out
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.CreateConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.WaitForCreatedConnectionAsync(ValueTask`1 creationTask)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.WebSockets.WebSocketHandle.ConnectAsyncCore(Uri uri, CancellationToken cancellationToken, ClientWebSocketOptions options)
   at System.Net.WebSockets.WebSocketHandle.ConnectAsyncCore(Uri uri, CancellationToken cancellationToken, ClientWebSocketOptions options)
   at System.Net.WebSockets.ClientWebSocket.ConnectAsyncCore(Uri uri, CancellationToken cancellationToken)
   at Discord.Net.WebSockets.DefaultWebSocketClient.ConnectInternalAsync(String host)
   at Discord.Net.WebSockets.DefaultWebSocketClient.ConnectAsync(String host)
   at Discord.API.DiscordSocketApiClient.ConnectInternalAsync()
   at Discord.API.DiscordSocketApiClient.ConnectInternalAsync()
   at Discord.API.DiscordSocketApiClient.ConnectAsync()
   at Discord.WebSocket.DiscordSocketClient.OnConnectingAsync()
   at Discord.WebSocket.DiscordSocketClient.OnConnectingAsync()
   at Discord.ConnectionManager.ConnectAsync(CancellationTokenSource reconnectCancelToken)
   at Discord.ConnectionManager.<>c__DisplayClass29_0.<<StartAsync>b__0>d.MoveNext()
2019-04-09 16:01:51.102 +00:00 [INF] 16:01:51 Gateway     Disconnecting
2019-04-09 16:01:51.103 +00:00 [INF] 16:01:51 Gateway     Disconnected
2019-04-09 16:01:52.865 +00:00 [INF] 16:01:52 Gateway     Connecting
2019-04-09 16:04:05.282 +00:00 [INF] 16:04:05 Gateway     System.TimeoutException: The operation has timed out.
   at Discord.ConnectionManager.WaitAsync()
   at Discord.WebSocket.DiscordSocketClient.OnConnectingAsync()
   at Discord.WebSocket.DiscordSocketClient.OnConnectingAsync()
   at Discord.ConnectionManager.ConnectAsync(CancellationTokenSource reconnectCancelToken)
   at Discord.ConnectionManager.<>c__DisplayClass29_0.<<StartAsync>b__0>d.MoveNext()
2019-04-09 16:04:05.284 +00:00 [INF] 16:04:05 Gateway     Disconnecting
2019-04-09 16:04:05.284 +00:00 [INF] 16:04:05 Gateway     Disconnected
2019-04-09 16:04:08.845 +00:00 [INF] 16:04:08 Gateway     Connecting
2019-04-09 16:05:48.846 +00:00 [INF] 16:05:48 Gateway     Disconnecting
2019-04-09 16:05:48.847 +00:00 [INF] 16:05:48 Gateway     Disconnected

After the last disconnect it never reconnected. It has done this multiple times now.

4egod commented 5 years ago

my client works fine about 1 day and then I've got this one: 04.23-18:20:57 DiscordBot Gateway Received Dispatch (PRESENCE_UPDATE) 04.23-18:21:00 DiscordBot Gateway Received Dispatch (PRESENCE_UPDATE) 04.23-18:21:16 DiscordBot Gateway Sent Heartbeat 04.23-18:21:16 DiscordBot Gateway Received HeartbeatAck 04.23-18:21:16 DiscordBot Gateway Latency = 234 ms 04.23-18:21:30 DiscordBot Gateway Received Dispatch (PRESENCE_UPDATE) 04.23-18:21:51 DiscordBot Gateway Received Dispatch (PRESENCE_UPDATE) 04.23-18:21:57 DiscordBot Gateway Sent Heartbeat 04.23-18:21:57 DiscordBot Gateway Received HeartbeatAck 04.23-18:21:57 DiscordBot Gateway Latency = 250 ms 04.23-18:22:38 DiscordBot Gateway Sent Heartbeat 04.23-18:22:39 DiscordBot Gateway Heartbeat Stopped 04.23-18:22:39 DiscordBot Gateway

at Discord.ConnectionManager.<>c__DisplayClass29_0.<b__0>d.MoveNext()

04.23-18:22:39 DiscordBot Gateway Disconnecting 04.23-18:22:39 DiscordBot Gateway Disconnecting ApiClient 04.23-18:22:39 DiscordBot Gateway Waiting for heartbeater 04.23-18:22:39 DiscordBot Gateway Waiting for guild downloader 04.23-18:22:39 DiscordBot Gateway Clearing large guild queue 04.23-18:22:39 DiscordBot Gateway Raising virtual GuildUnavailables 04.23-18:22:39 DiscordBot Gateway Disconnected from Bitsum Test 04.23-18:22:39 DiscordBot Gateway Disconnected from Bitsum 04.23-18:22:39 DiscordBot WebSocket connection was closed 04.23-18:22:39 DiscordBot at Discord.ConnectionManager.<>c__DisplayClass29_0.<b__0>d.MoveNext() 04.23-18:22:40 DiscordBot Rest GET voice/regions: 712.04 ms 04.23-18:22:42 DiscordBot Gateway A Disconnected handler is blocking the gateway task.

NovusTheory commented 5 years ago

@4egod Seems to me like your case is exactly as your last message. You're blocking the gateway which doesn't relate to the problems here where nothing is blocking the gateway.

4egod commented 5 years ago

@NovusTheory Ok. Thanks. Can you give me an advice? What should I do to figure out?

nopjmp commented 4 years ago

My current theory that may also fix #1052 is the ConnectionManager is not Cancelling the cancellation token that the heartbeat task is using when reconnecting. I may be wrong but I'm showing the same issue on my bot.

[21:59:20 WRN] [null]
System.Exception: Server missed last heartbeat
   at Discord.ConnectionManager.<>c__DisplayClass29_0.<<StartAsync>b__0>d.MoveNext()
[21:59:20 INF] Disconnecting
[21:59:20 INF] Disconnected
[21:59:21 INF] Connecting
[21:59:31 WRN] Failed to resume previous session
[21:59:31 INF] Connected
[21:59:33 INF] Ready
[22:02:15 WRN] [null]
System.Exception: Server missed last heartbeat
   at Discord.ConnectionManager.<>c__DisplayClass29_0.<<StartAsync>b__0>d.MoveNext()
[22:02:15 INF] Disconnecting
[22:02:15 INF] Disconnected
[22:02:16 INF] Connecting
[22:02:17 INF] Connected
[22:02:17 INF] Resumed previous session
[22:09:09 WRN] [null]
System.Exception: Server missed last heartbeat
   at Discord.ConnectionManager.<>c__DisplayClass29_0.<<StartAsync>b__0>d.MoveNext()

https://github.com/discord-net/Discord.Net/blob/a4846516fbce6cc3f80c04be43667f34048e23c4/src/Discord.Net.WebSocket/DiscordSocketClient.cs#L476

https://github.com/discord-net/Discord.Net/blob/a4846516fbce6cc3f80c04be43667f34048e23c4/src/Discord.Net.WebSocket/DiscordSocketClient.cs#L1678-L1719

I may be missing something in the ConnectionManager where it's cancelling the token that is being used. I've only spent a few minutes on this and can try to help dig deeper into the error connections.

darkalfx commented 4 years ago

Handling all the edge cases in the reconnect logic is hard - especially when discord outages are concerned, so I doubt we'll ever get all of them. This is definitely something we'll look into though.

I don't disagree with that assertion, so assuming that you guys can't or won't be able to handle all possible edge-cases anytime soon, can we at least have an option to stop the API from reconnecting at all?

Let the API report Disconnected and let me manually reconnect it, should be way easier for everyone involved than to try to hunt down all of the re-connection edge cases and it will be stable as hell.

RheaAyase commented 4 years ago

Indeed, at the moment the only way to reliably reconnect is to kill your whole application. The only other option is memory leaking the Discord client and creating a new one - because you can't safely dispose of it without deadlock ( #1448 )

I have a small bot that I don't mess with very often and that one I restart only once a month and memoryleaking it in the meantime. Effectively that means going from 20GBs of RAM usage to about 45GBs in that one month on a bot that's on 3 Discord servers of < 5000 users total. Can't imagine leaking the big one. That gets booted on every Disconnect.

FiniteReality commented 4 years ago

One potential idea I had to fix this would be to go the route of Wumpus.Net. Wumpus.Net drops the StartAsync/StopAsync system and instead uses a RunAsync method to run the client. This RunAsync method would run the client until an unhandled exception occurs, including reconnecting and resuming where possible, accepting a cancellation token. Then, the StopAsync method would cancel this RunAsync call, or you could pass in your own CancellationToken.

To retain backwards compatibility, StartAsync would likely be rewritten into:

async Task StartAsync()
{
    while (true)
    {
        try
        {
            await RunAsync(CancellationToken.None);
        }
        catch (OperationCanceledException)
        {
            break;
        }
        catch (Exception)
        {
            // potentially log here instead of throw
            throw;
        }
    }
}

Going this route gives us more flexibility, and also lets us drop that annoying ConnectionManager class which is frankly impossible to debug IMHO.

etharion commented 4 years ago

For me it happens alot nowadays, a month ago it only did it every other day. Dont know if this is already stated, but somehow when it does this theres multiple instances listening to commands and all of them run the command. Can we alternatively catch the expection and reboot the program? Not sure if this is possible. image image

RheaAyase commented 4 years ago

Well, the amount of Disconnects is on Discord's own infrastructure and you can't do anything about it. Other than cute graphs ~ image

(And yes you can subscribe to the Disconnected event inside of which you can Environment.Exit or something.)

etharion commented 4 years ago

About the Environment.Exit, this that something you plan to make a hotfix of? Else I will try and look into it myself.

RheaAyase commented 4 years ago

Yeah this should really be bumped up and prioritized with these stupid changes. image

etharion commented 4 years ago

@RheaAyase Thanks for the heads up! I might still see if I can make something myself in the meanwhile tho.

AraHaan commented 4 years ago

Currently my only thought on this is on the example replace the await Task.Delay(Timeout.Infinite) line in the example using DI for commands and a command service with something like this instead might fix it for me. As I currently get this after a few hours starting my webapi where I cannot afford to reboot the whole process every time it fails so I am trying this now:

var connTime = 0;
while (true)
{
    if (client.ConnectionState == ConnectionState.Disconnected)
    {
        connTime++;
        if (connTime.Equals(30))
        {
            // if only we could call the log event method ourselves without compile error.
            await client.Log.Invoke(new LogMessage(LogSeverity.Info, "Gateway", "Has not been connected for 30 seconds. Assuming deadlock in connector.")).ConfigureAwait(false);
            await client.StartAsync().ConfigureAwait(false);
        }
    }
    else if (client.ConnectionState == ConnectionState.Connected && !connTime.Equals(0))
    {
        connTime = 0;
    }

    await Task.Delay(1000).ConfigureAwait(false);
}

Also another idea, perhaps I could have my bot use a branch of the code, find the logic behind reconnections then use the logger of d.net to debug the code (using line numbers as well pasted in to the thing to find where the deadlock is). Since I now write all the contents ever logged to a file named Discord.log. However here is another issue though, making the discord.net log level to be detailed on the event method so I can see all logs that the library can make on the file.

AraHaan commented 4 years ago

Alright so I am adding a logger to ConnectionManager logging all lines until the deadlock happens so that can identify where the deadlock really is (with the help of the _logger member of that class)

Also on first look I personally think this needs changed though (because I think that for normal "errors" that are deamed not "critical" could be causing this too and possibly not a deadlock because the cancelation token is not canceled?

        public void Error(Exception ex)
        {
            _readyPromise.TrySetException(ex);
            _connectionPromise.TrySetException(ex);
            _connectionCancelToken?.Cancel();
+            _reconnectCancelToken?.Cancel();
        }
        public void CriticalError(Exception ex)
        {
-            _reconnectCancelToken?.Cancel();
            Error(ex);
        }

While this might not have anything to do with it I think it might be causing the task to identify as completed or something. And that is if it is the real issue that would be the simplest fix, if not then that means the issue might be somewhere in the ApiClient class that is then used for opcodes and stuff in the regular SocketClient.

However problems like this I think deserves it's own log event that is used merely for debugging purposes only.

AraHaan commented 4 years ago

I also now notice what happens when it does this, it reconnects, but that it fails to send the ready (because GuildAvailible blocks the gateway) and then disconnects from an timeout error. Lets see what the log shows:

Discord_debug.log

Even though the bot is in only a single guild it seems that the A GuildAvailable handler is blocking the gateway task. could be where the deadlocks like that breaks the connector and spams trying to reconnect but then immediately disconnects when that handler causes the connector to time out.

I have found this by setting the log level in Discord.Net to Debug.

Thanks to the pull request for the ratelimits it seems that now it does try to recover and eventually becomes to the state where everything is usable again and works.

However I think the GuildAvailable Handlers needs a looking into further, if it was to get the blocking issue fixed maybe the TimeoutExceptions could disappear and the whole issue with holes of the bot being an "ghost" and showing as offline (often times 10 minutes to a few hours) after just being online for a total of 2 hours might fix this issue and the other issue that looks to be a duplicate of this one as well #1364.

AraHaan commented 4 years ago

After futher testing from swallowing the timeout exception from here https://github.com/discord-net/Discord.Net/blob/dev/src/Discord.Net.WebSocket/ConnectionManager.cs#L139 on https://github.com/discord-net/Discord.Net/blob/dev/src/Discord.Net.WebSocket/ConnectionManager.cs#L142 (instead of what L142 currently is on the github ofc) that all is working properly now. I suspect that the timeout values is not properly calculated and seems to be too short. However I think commenting out L139 or removing is also safe to do as well.

When I done this I have not experienced this issue in now over 2 hours. But I will report back when I find anything different. If I do not report back assume this change fixed this issue (and #1364).

AraHaan commented 4 years ago

Alright so I have discovered that the last line that it logs before it deadlocks in ConnectionManager.cs is this one https://github.com/discord-net/Discord.Net/blob/dev/src/Discord.Net.WebSocket/ConnectionManager.cs#L190

In a way I do not see what would actually cause a deadlock like this as such I might have to do the same logging on every line of code that can be hit (even if and else blocks) to see if the deadlock is in DiscordSocketClient or in DiscordSocketApiClient. It could be hiding in 1 of those classes and we not even know it.

@SubZero0 think I should add logging on every lines on those 2 files then pr your branch on your fork that goes into your pr that it builds off of? Maybe having this kind of logging can help prevent future issues with the gateway code that is in the library.

SubZero0 commented 4 years ago

@AraHaan Not sure if I understood but all builds are made from the master branch in this repo, my fork is what I use to write my changes (or fixes or whatever) to then PR them here. The only custom branch there is the one I use for my own bot that I haven't had issues with the connection deadlock in years (and no, there's no changes to the ConnectionManager), so it shouldn't really help finding it.