Azure / azure-signalr

Azure SignalR Service SDK for .NET
https://aka.ms/signalr-service
MIT License
426 stars 100 forks source link

Huge Connection Drop every day without apparent code reason #1853

Closed kauezatarinpecege closed 1 year ago

kauezatarinpecege commented 1 year ago

Describe the bug

Hello!

I’m involved into an investigation about signalR users being disconnected every day, generally at the same timespan. I really can’t see queer the problem is, and everything points to a SignalR hub issue (as if the hub has been reopening some connections randomly).

Our SignalR hubs are hosted on an Azure App Service, and the backend SDK version is 1.21.7 (7.0.5 on frontend).

As you can see in the below charts, we have a huge connectionCount drop at many subsequent days. I’ve checked on the server side and nothing really happened at this timespan. Virtually there is no reason to so many connections to be interrupted at once.

When the problem first started the last application release (front and back) had been made a month ago.

image image

We have a reconnection logic on the clients, so they will reconnect right after. The major problem here is that all the clients try to reconnect at once, and it makes the backend reach 100% CPU usage very quickly.

This is really weird, also. Can somebody explain why there are so many ServiceReloads at this time?

image

Did anybody experienced something like this before? Am I missing something? Any help is appreciated!

To Reproduce

We are not sure on how to reproduce this error as it seems to be related to the events mentioned above.

We will close this issue if:

Further technical details

vicancy commented 1 year ago

If the issue happened everyday around 00:55 UTC time since Sep 27th, it is the below issue and should've fixed on Oct. 18th, details as below:

Issue:

Customer see connection drops around 1:00 AM UTC time everyday from Sep 27th.

Cause:

The cert used by service ingress component has an old Subordinate CA that expires within one year, the service has a security policy that it always tries to renew the cert if it expires within one year. That leads to ingress component starting new instance with renewed cert while killing the old instance within 1 hour. As a result, the persistent connections connected to the old ingress instances were dropped.

Fix:

We updated the cert to use the new Public issuer that provides certs from a new set of Subordinate CAs.

Improvements:

Azure SignalR team will add alerts on such cert issue to early detect the issue Azure SignalR team plans to add graceful shutdown logic into the ingress components so that such connection drops could be categorized as "ServiceReload" instead of "NoError"

kauezatarinpecege commented 1 year ago

Thank you @vicancy, it seems that this issue affected our environment; the dates and times match our occurrences. We haven't seen any more problems until now.