Azure / azure-signalr

Azure SignalR Service SDK for .NET
https://aka.ms/signalr-service
MIT License
425 stars 100 forks source link

Hub going offline after about 3 days #1058

Open Iguanadad opened 4 years ago

Iguanadad commented 4 years ago

We have a project using azure-signalr which is working fine for about 3 days and then, although the server API hosting the hub is still online, no signals are broadcast from it to any of the clients.

Restarting the SignalR service in Azure does nothing, and the only way to re-establish the connections is to restart the hub server.

We are calling... GlobalHost.ConnectionManager.GetHubContext() ...to get the IHubContext each time we attempt to broadcast (because when we tried keeping a static copy of the IHubContext we failed to get any messages broadcast at all) so we cannot see any way we could be getting an out of date context.

Any suggestions on how we can keep the service running for more than 3 days would be appreciated.

After turning on the detailed logging we have the following after one such failure (redacted the service_name and service_cid):

Microsoft.Azure.SignalR Information: 0 : [Microsoft.Azure.SignalR.Connections.Client.Internal.WebSocketsTransport]Starting transport. Transfer mode: Binary. Url: 'wss://<service_name>.service.signalr.net/aspnetserver/?hub=priorityinspection_priorityinspectionhub&cid=<service_cid>'.
Microsoft.Azure.SignalR Information: 0 : [Microsoft.Azure.SignalR.Connections.Client.Internal.WebSocketsTransport]WebSocket closed by the server. Close status NormalClosure.
Microsoft.Azure.SignalR Warning: 0 : [Microsoft.Azure.SignalR.AspNet.ServiceConnection]Connection 0115170b-2b71-416d-8b17-06e875c4efac received error message from service: Connection ping timeout.
Microsoft.Azure.SignalR Information: 0 : [Microsoft.Azure.SignalR.Connections.Client.Internal.WebSocketsTransport]Transport is stopping.
Microsoft.Azure.SignalR Information: 0 : [Microsoft.Azure.SignalR.Connections.Client.Internal.WebSocketsTransport]WebSocket closed by the server. Close status NormalClosure.
Microsoft.Azure.SignalR Warning: 0 : [Microsoft.Azure.SignalR.AspNet.ServiceConnection]Connection 09621190-9aa1-4858-803d-ef0794f8d8f7 received error message from service: Connection ping timeout.
Microsoft.Azure.SignalR Information: 0 : [Microsoft.Azure.SignalR.Connections.Client.Internal.WebSocketsTransport]Transport is stopping.
Microsoft.Azure.SignalR Information: 0 : [Microsoft.Azure.SignalR.Connections.Client.Internal.WebSocketsTransport]Starting transport. Transfer mode: Binary. Url: 'wss://<service_name>.service.signalr.net/aspnetserver/?hub=priorityinspection_priorityinspectionhub&cid=<service_cid>'.
Microsoft.Azure.SignalR Warning: 0 : [Microsoft.Azure.SignalR.AspNet.ServiceConnection]Connection 43bed23f-0b5e-45a1-82fa-8c071012548d received error message from service: Connection ping timeout.
Microsoft.Azure.SignalR Information: 0 : [Microsoft.Azure.SignalR.Connections.Client.Internal.WebSocketsTransport]WebSocket closed by the server. Close status NormalClosure.
Microsoft.Azure.SignalR Warning: 0 : [Microsoft.Azure.SignalR.MultiEndpointServiceConnectionContainer]BroadcastDataMessage message (null) is not sent to endpoint (Primary)https://<service_name>.service.signalr.net because all connections to this endpoint are offline.
Microsoft.Azure.SignalR Warning: 0 : [Microsoft.Azure.SignalR.MultiEndpointServiceConnectionContainer]BroadcastDataMessage message (null) is not sent to endpoint (Primary)https://<service_name>.service.signalr.net because all connections to this endpoint are offline.
Microsoft.Azure.SignalR Warning: 0 : [Microsoft.Azure.SignalR.MultiEndpointServiceConnectionContainer]BroadcastDataMessage message (null) is not sent to endpoint (Primary)https://<service_name>.service.signalr.net because all connections to this endpoint are offline.

The last error message is then repeated several hundred times.

KKhurin commented 4 years ago

@Iguanadad, thanks for the report. This should not happen as the SDK code is designed to automatically reconnect to the service. If you'd like us to take a look at the backend please send your ResourceID (it looks like /subscriptions/ xx /resourceGroups/ yy /providers/Microsoft.SignalRService/SignalR/ zz) to [my github alias]@microsoft.com

vicancy commented 4 years ago

What is the SDK version used? - noted that you are using the latest 1.5.1.

Could you help to check the memory and CPU usage of the app server? Are they abnormal before you restart the app server? "Connectio ping timeout" usually happens when the app server is too busy to handle the ping messages from the Azure SignalR Service and the service closes the connection.

Iguanadad commented 4 years ago

CPU and memory usage are not at all abnormal and not close to the limits of the service plan. Also, the app server is very responsive to incoming requests from another app service so it doesn't appear to be a loading issue.

I would be happy to implement something to detect the offline state and reconnect to the service but I can't find any way to do those two things from our code (as @KKhurin mentioned, the SDK is supposed to handle the reconnection automatically)

vicancy commented 4 years ago

Yes it should always reconnect. Could you share with @KKhurin and me lianwei(at)microsoft.com the log files(with timestamp, or with cid provided so we can narrow down through logs) when the incident happens? Also, when the issue takes place again, could you help to take a dump file for us?

vicancy commented 4 years ago

With a second look, it should be the same issue https://github.com/Azure/azure-signalr/pull/763/files tries to fix however seems still can happen in some circumstances that for the server connection, the transport layer is stopped, however, the server connection is not restarted/ or restarted as an on-demand connection. BroadcastMessages only goes through fixed connections. Still need more info to find the root cause.

vicancy commented 4 years ago

Should be fixed in the latest release 1.6.0, please try

Iguanadad commented 4 years ago

Latest release deployed without an issue - many thanks