Azure / azure-signalr

Azure SignalR Service SDK for .NET
https://aka.ms/signalr-service
MIT License
427 stars 101 forks source link

Can I ignore `ServiceConnectionNotActiveException` and/or `AzureSignalRUnauthorizedException` logged directly by the SDK? #1486

Closed sanderaernouts closed 1 year ago

sanderaernouts commented 3 years ago

We recently deployed Azure SignalR Service in West Europe and we mostly see errors at night or over the weekend. I have inspected the source code on here on GitHub and it seems that the SDK logs these error messages directly as well as throwing an exception that can be handled by our code. We have retry mechanisms in place but since these errors are logged by the SDK directly it's hard to determine whether we have a problem or whether I'm looking at normal operation and regular logging from the Azure SignalR Service SDK

We see a lot of ServiceConnectionNotActiveException with the message Error while sending message to the service, the connection carrying the traffic is dropped. Error detail: Service reloading, please reconnect. logged to our application insight with the following stack trace:

0 {"method":"Microsoft.Azure.SignalR.ServiceConnectionBase+<WriteAsync>d__49.MoveNext","level":0,"assembly":"Microsoft.Azure.SignalR.Common, Version=1.11.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60","line":0}
1 {"method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":1,"assembly":"System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","line":0}
2 {"method":"System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess","level":2,"assembly":"System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","line":0}
3 {"method":"Microsoft.Azure.SignalR.ServiceConnection+<ProcessOutgoingMessagesAsync>d__25.MoveNext","level":3,"assembly":"Microsoft.Azure.SignalR, Version=1.11.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60","line":0}

And a few AzureSignalRUnauthorizedException with the message Authorization failed. If you were using AccessKey, please check connection string and see if the AccessKey is correct. If you were using Azure Active Directory, please note that the role assignments will take up to 30 minutes to take effect if it was added recently. with no further stack trace.

ServiceConnectionNotActiveException

ServiceConnectionBase.WriteAsync

In the case of ServiceConnectionBase.WriteAsync the error is logged by SafeWriteAsync and then rethrown by WriteAsync.

If I understand this correctly, in this case I could "ignore" the error logged by the SDK because a ServiceConnectionNotActiveException is thrown as well which I can catch and retry in my own code.

Microsoft.Azure.SignalR.ServiceConnection.ProcessOutgoingMessagesAsync

In the case of Microsoft.Azure.SignalR.ServiceConnection.ProcessOutgoingMessagesAsync the exception is only logged and not rethrown in ProcessOutgoingMessagesAsync.

It looks like this is some kind of message loop so if a ServiceConnectionNotActiveException is thrown here does this mean the connection itself was ended? Is the message now lost or is it send again when the client reconnects?

In other words can I "ignore" this error as well as long as the client code reconnects to SignalR when it is disconnected for some reason?

AzureSignalRUnauthorizedException

This might be related to the lifetime of the JWT token as mentioned here. Can this be "ignored" as well as long as the client attempts to reconnect upon receiving a 401?

vicancy commented 3 years ago

Is the message now lost or is it send again when the client reconnects?

Message is lost in this case, so the error generally means the "message" was lost due to the connection being disconnected.

vicancy commented 3 years ago

For AzureSignalRUnauthorizedException, are you using AAD to connect?

sanderaernouts commented 3 years ago

Is the message now lost or is it send again when the client reconnects?

Message is lost in this case, so the error generally means the "message" was lost due to the connection being disconnected.

We do retry these messages as long as an exception is thrown by the SignalR client. So if it is an transient fault that caused the disconnect then the message might still arrive on a second or third try.

For AzureSignalRUnauthorizedException, are you using AAD to connect?

We use AAD to authenticate the user and we use AAD (a managed user identity with RBAC roles) to connect to SignalR service.

vicancy commented 3 years ago

ServiceConnectionNotActiveException yes it should be transient, so if you do have retry logic, I think it would be safe to ignore this exception.

@terencefan could you confirm if seeing AzureSignalRUnauthorizedException when using AAD is expected and under which condition could it happen?

terencefan commented 3 years ago

@sanderaernouts Could you please share with us how you configured your AAD Auth?

  1. The TokenCredential you are currently using in your code.
    • In this case, either DefaultAzureCredential, ManagedIdentityCredential or EnvironmentalCredential could be possibly used.
  2. The Role you are currently using on Azure Portal.

Could you also share with me tefa(at)microsoft.com ((at) to @) your ResourceId or ResourceName so we could check our logs to see if the exception was expected or not?

sanderaernouts commented 3 years ago

@terencefan I have answered your questions below:

  1. The TokenCredential you are currently using in your code.

We don't use a specific implementation of TokenCredential in our code. Instead we use a connection string with the following format Endpoint=https://<signalR-service>.service.signalr.net;AuthType=aad;ClientId=<client-id>;version=1.0;, where <client-id> is the clientId of our user managed identity.

  1. The Role you are currently using on Azure Portal.

We assign the SignalR App Server (Preview) role to the user managed identity

Could you also share with me tefa(at)microsoft.com ((at) to @) your ResourceId or ResourceName so we could check our logs to see if the exception was expected or not?

✔️

vedion commented 2 years ago

Hi,

Any updates on this? We are getting the same two exceptions when using SignalR.

emulic commented 2 years ago

Hi, we are also experiencing the same issue. Any updates?

vicancy commented 2 years ago

For AzureSignalRUnauthorizedException, it should be fixed by the latest release. (1.8.3) For ServiceConnectionNotActiveException, some unnecessary ServiceConnectionNotActiveException exceptions are ignored with the latest release, some of such exceptions are expected, please provide me the details of the exception (call stacks, when did it happen) so I can do a further check.

datwelk commented 1 year ago

We are experiencing the same ServiceConnectionNotActiveException.

Just now at 12.44 pm CEST it started and lasted for about 15 minutes.

Past few occurrences of this issues are:

Could this be correlated with deploys that occur on Microsoft's side?

vicancy commented 1 year ago

Please feel free to reopen the issue if there are any updates.

Hi @datwelk Sorry for the late response, when such exception happens and there were no abnormal logs/events/network issues from your server side, please open an issue or support ticket for the support team to handle the issue promptly. You could also email me lianwei(at)microsoft.com the resource name for me to have a further check if such issue happens again.