dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.14k stars 2.04k forks source link

Azure App Service fails to start with never ending Orleans connection exceptions #8891

Open justinm0 opened 9 months ago

justinm0 commented 9 months ago

We have an Azure App Service running Orleans 7.2.4 with ASP.NET Core on .NET 8. It has been running well for a few weeks. This morning after a scheduled upscale the site failed to start and logs exceptions like this:

Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.0.2.254:20016:68181414, will retry after 454.0547ms
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226 

Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.2.254:20016. Error: AccessDenied
Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.2.254:20024. Error: AccessDenied

Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.0.2.254:20024:68182019. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.2.254:20024. Error: AccessDenied
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 54
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 193
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 221
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|29_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 226 

These happened continuously for hours. We had to remove Orleans before the site would start up.

Private ports are configured to provide 2 ports. Orleans is configured like this:

var endpointAddress = IPAddress.Parse(context.Configuration["WEBSITE_PRIVATE_IP"]);
var strPorts = context.Configuration["WEBSITE_PRIVATE_PORTS"].Split(',');
var siloPort = int.Parse(strPorts[0]);
var gatewayPort = int.Parse(strPorts[1]);

siloBuilder
    .UseAzureStorageClustering(options =>
    {
        options.TableName = orleansSettings.TableStorageConfig.TableName;
        options.ConfigureTableServiceClient(orleansSettings.TableStorageConfig.ConnectionString);
    })
    .ConfigureEndpoints(endpointAddress, siloPort, gatewayPort)
    .Configure<ClusterOptions>(options =>
    {
        options.ClusterId = orleansSettings.ClusterConfig.ClusterId;
        options.ServiceId = orleansSettings.ClusterConfig.ServiceId;
    })
        .AddAzureBlobGrainStorage(DefaultGrainStorageProvider, options =>
        {
            options.ConfigureBlobServiceClient(orleansSettings.BlobStorageConfig.ConnectionString);
            options.ContainerName = orleansSettings.BlobStorageConfig.ContainerName;
        });

We also use Orleans Dashboard. We do not use reminders or streams.

The App Service has only 1 instance and is the only member of the cluster. The clustering table has about 100 rows in it. The Status of some rows is Dead but many more are Joining or Active.

Changing to a different clustering table didn't resolve the issue.

ReubenBond commented 9 months ago

Which App Service plan are you using? If you are only using a single host (and you have no external Orleans clients), you could try with UseLocalhostClustering()

justinm0 commented 9 months ago

We are using a Windows app service plan on sizes S1 or S2 in South Africa North.

I discovered that the app was being automatically restarted because it wasn't starting up in time, which is not fully Orleans's fault, although Orleans does add to the startup time. I increased the startup timeout through web.config and the site now starts.

However, if it uses the table with the many records it still keeps trying to connect to other hosts forever, throwing exceptions. If I switch it to a clean table it runs without exceptions.

Is there any way to resolve this without clearing the table?

justinm0 commented 8 months ago

Any insight on my question above? This issue still happens sometimes. Is using localhost clustering the only option here?

ReubenBond commented 1 month ago

This should work. One change which might be needed is to set listenOnAllHostAddresses: true. I have opened a PR to update docs with full instructions: https://github.com/dotnet/docs/pull/42781/files#diff-b66d9de75a966a6e75cbc3d3608c153dac5340e7d12a2ef6ba6758929e2875b9