Azure / azure-signalr

Azure SignalR Service SDK for .NET
https://aka.ms/signalr-service
MIT License
427 stars 101 forks source link

Error: Failed to invoke 'JoinGroup' due to an error on the server #2045

Open wbuck opened 2 months ago

wbuck commented 2 months ago

Describe the bug

We're attempting to use Azure SignalR (with managed identities) with an Angular frontend and ASP.net Core backend. For a particular method call I'll see the following error about 50% of the time:

Error: Failed to invoke 'JoinGroup' due to an error on the server.

We're supplying an object with 3 properties to the JoinGroup function. From the Angular application this looks like the following:

const url = 'some url';

const options: IHttpConnectionOptions = {
   accessTokenFactory: async () =>  await service.getToken(),
   transport: HttpTransportType.WebSockets,
   withCredentials: false
};

const hub = new HubConnectionBuilder()
      .withUrl(url, options)
      .configureLogging(LogLevel.Debug)
      .withAutomaticReconnect()
      .build();

await hub.start();

// This is the call that fails about 50% of the time.
const msg = await this.hub.invoke<string>('JoinGroup', {
    organizationId: '1bf751e8-67cd-4fb0-89cc-206c31371ac1',
    environmentId: '02a03e43-839b-4f71-88b1-d809d26a0499',
    recoveryPlanId: '9c18d9b7-a58f-4f8f-8c46-aedc996dc304'
});

From the ASP.net core application our hub looks like the following:

public async Task<string> JoinGroup(RecoveryGroupName group)
{
    await Groups.AddToGroupAsync(Context.ConnectionId, group.ToString());
    return "You've joined a group!";
}

And the RecoveryGroupName looks like the following:

public record RecoveryGroupName
{
    public required Guid OrganizationId { get; init; }
    public required Guid EnvironmentId { get; init; }
    public required Guid RecoveryPlanId { get; init; }

    // Right now I'm just creating a new string here each time ToString is called.
    // Once I get this bloody this working I'll fix it.
    public override string ToString() 
        => $"{OrganizationId}_{EnvironmentId}_{RecoveryPlanId}";
}

I can't figure out why this is happening!! I don't see any errors from the server (nothing in the logs or console when running locally).

I've set EnableDetailedErrors to true but that made no difference.

builder.Services
    .AddSignalR(opt => opt.EnableDetailedErrors = true)
    .AddAzureSignalR(option => /* Setting up connection string etc */);

All I see in the browser is the following: image

Am I doing something wrong here? How can I troubleshoot this? There are no error messages!

One thing I've noticed is if I set ServerStickyMode to Required I don't see the issue:

builder.Services
    .AddSignalR(opt =>
    {
        opt.EnableDetailedErrors = true;

    })
    .AddAzureSignalR(option =>
    {
        option.ServerStickyMode = ServerStickyMode.Required;
        option.Endpoints =
        [
          new ServiceEndpoint(new Uri("ourUrl"), tokenCredential)
        ];
    })

Why does this fix my issue? I didn't think we needed sticky sessions since we're using Azure SignalR..

Any insight from you folks would be really appreciated!

I've enabled additional server side logging via:

"Microsoft.AspNetCore.SignalR": "Debug",
"Microsoft.AspNetCore.Http.Connections": "Debug"

But this doesn't show any errors even when I recreate the error. All I see is the following additional logging:

[10:36:52 DBG] OnConnectedAsync started. [10:36:52 DBG] Found protocol implementation for requested protocol: json. [10:36:52 DBG] Completed connection handshake. Using HubProtocol 'json'. [10:36:52 INF] SignalR client XFxkllJJ3TX7P9bhhN1eBQfLK5oAY02 connected [10:36:53 DBG] Received hub invocation: InvocationMessage { InvocationId: "0", Target: "JoinGroup", Arguments: [ 5ec7ee49-6153-0862-189d-4ee6b5713fd8_294a4a76-f9f8-4f91-3522-08dc9ad378bc_df61c54d-c80c-4ab0-c990-08dcd1aa483c ], StreamIds: [ ] }.

I've also enabled the trace tool but again it doesn't show that there is any errors!

image

Can anyone help me with this?

To Reproduce

Exceptions (if any)

Further technical details

wbuck commented 2 months ago

Could someone at least point me to a Microsoft rep I can talk to about this?

vicancy commented 2 months ago
  1. You could enable EnableDetailedErrors in your app server so that your client side could see the detailed exception thrown
    services.AddSignalR(o=>o.EnableDetailedErrors = true).AddAzureSignalR()
  2. If ServerStickyMode to Required you don't see the issue, there are several possibilities:
    1. You have different versions of app server connecting to the same Azure SignalR using the same hub name, so the client is actually routed to an old version of app server which has a different logic handling the JoinGroup method and thus throw. In this case, You could try specify a different ApplicationName option to see if the issue persists.
      services.AddSignalR()
      .AddAzureSignalR(options => options.ApplicationName= "A");
    2. In your JoinGroup method you have some state dependencies which are created during the client Negotiation step, for example, you have some auth middleware to store some data in memory when the client negotiates with your app server, and you expect to use that data inside JoinGroup method.
wbuck commented 2 months ago

Hi @vicancy, so I have already enabled the detailed errors in Program.cs (I mentioned in my original post) but it still never shows me any exceptions.

The JoinGroup method is very simple, it just adds the incoming connection ID to a group. The logic of the JoinGroup function is the same for all instances of our API service.

You have different versions of app server connecting to the same Azure SignalR using the same hub name

So let's say we have our production API service deployed in Kubernetes connected to Azure SignalR. Now let's say I'm also running the API service locally on my dev machine which is also connected to the same Azure SignalR.

Would this potentially cause this issue?

If the answer is yes, how should I handle this in a production scenario when the API service running in k8 has multiple instances running? Should I create a custom (random) application name for each running instance of the API service?

Also, if I set a random application name for each running instance of the API service does this mean I can stop using sticky sessions?

vicancy commented 2 months ago

So let's say we have our production API service deployed in Kubernetes connected to Azure SignalR. Now let's say I'm also running the API service locally on my dev machine which is also connected to the same Azure SignalR.

When you run the API service locally on your dev machine that connects to the same Azure SignalR, then your local dev machine and your production API services are all considered as running instances for the same app, and your production connections might route to your local API. Usually we use a separate Azure SignalR instance for dev/test purpose. You could also use a different ApplicationName for different stages.

how should I handle this in a production scenario when the API service running in k8 has multiple instances running

This depends on your scenario. Are these k8s instances sharing the same Hub logic or not? If they are multiple instances for the same Hub logic, they are the same ApplicationName. If they are for different purpose and having different Hub logic, then they should use different ApplicationNames to isolate from others.

You could also try logging the exceptions thrown in the method in your app server side to view the detailed error:

public class YourHub: Chat{
    private ILogger _logger;  
    public YourHub(ILogger<YourHub> logger){
        _logger = logger;
    }

    public void JoinGroup(){
       try{
               ...
       }catch(Exception e){
          _logger.LogError(e...)
       }
   }
}