Silos Experience RegisterActivationInGrainDirectory Exceptions then crash

SebFerraro commented 3 years ago

We have seen our uat cluster of 6 silos experience many RegisterActivationInGrainDirectory exceptions for around 30 seconds or so, then crash.

It is seemingly random:

https://gist.github.com/SebFerraro/e4220dcb35fd680dee2d1b2438f4a318

ReubenBond commented 3 years ago

Could you please include your Orleans version as well as details about how your cluster is configured? Redacted config code is good for the latter.

SebFerraro commented 3 years ago

Oh of course, apologies:

Orleans v3.3.0 .NET Framework 4.8

I assume when you mean cluster configuration:

builder.Configure<ClusterOptions>(
                options =>
                {
                    options.ClusterId = Constants.DeploymentName;
                    options.ServiceId = "BifrostService";
                })
                .UseDashboard(options =>
                {
                    options.Host = "*";
                    options.Port = 8080;
                    options.HostSelf = true;
                    options.CounterUpdateIntervalMs = 1000;
                });
const string invariant = "System.Data.SqlClient";
                var connectionString = Constants.GetConnectionString();
                builder
                    .ConfigureEndpoints(22222, 40000)
                    .UseAdoNetClustering(options =>
                    {
                        options.Invariant = invariant;
                        options.ConnectionString = connectionString;
                    })
                    .UseAdoNetReminderService(options =>
                    {
                        options.Invariant = invariant;
                        options.ConnectionString = connectionString;
                    })
                    .UseSiloUnobservedExceptionsHandler();
builder.AddSimpleMessageStreamProvider(
                    "SMSProvider",
                    options =>
                    {
                        options.FireAndForgetDelivery = true;
                        options.OptimizeForImmutableData = true;
                    })
                .Configure<ClientMessagingOptions>(options =>
                {
                    options.ResponseTimeout = TimeSpan.FromSeconds(10);
                    options.ResponseTimeoutWithDebugger = TimeSpan.FromSeconds(10);
                })
                .AddMemoryGrainStorageAsDefault()
                .AddMemoryGrainStorage("PubSubStore")
                .UseServiceProviderFactory(new AutofacServiceProviderFactory())
                .ConfigureContainer<ContainerBuilder>(this.ConfigureContainer)
                .ConfigureLogging(this.ConfigureLogging)
                .ConfigureApplicationParts(parts =>
                    parts.AddApplicationPart(typeof(UnitGrain).Assembly).AddApplicationPart(typeof(IUnitGrain).Assembly));

            this.siloHost = builder.Build();

ReubenBond commented 3 years ago

Are there any other components being configured, such as grain call filters? Are you able to provide more logs? Some of the errors you're seeing appear to be the sort of thing you'd see if you were having severe performance degradation. If you upgrade to Orleans v3.4.0, then you should see extra warning logs if you do have severe perf issues (such as caused by thread pool starvation).

SebFerraro commented 3 years ago

I'll update to that right away. I only pushed the update to 3.3.0 a week or so back, didn't realise 3.4.0 was so close. Will report back with errors once I've done that. I can provide the entire logfile (~300mb) for the latest crash if that helps at all. It's roughly 24 hours worth of logs I think

ReubenBond commented 3 years ago

When the silo crashes, are there any last log messages printed? Do you have a crash dump that I could analyze?

SebFerraro commented 3 years ago

Unfortunately the last logs are the above:

[2021-01-13 13:08:20.503 GMT 319 Information 100147 Orleans.Runtime.Dispatcher] Intermediate NonExistentActivation for message OneWay S172.30.88.21:22222:348148000*cli/30873788@06548493->S172.30.88.21:22222:348148000*grn/FFE8F7B3/00000000@cc474c95 #59082119[ForwardCount=1], with Exception Orleans.Runtime.Catalog+NonExistentActivationException: Non-existent activation: [ActivationAddress: S172.30.88.21:22222:348148000*grn/FFE8F7B3/00000000@cc474c95, Full GrainId: [GrainId: *grn/FFE8F7B3/0000000000000000000000000000000003ffffffffe8f7b3-0x43F93686, IdCategory: Grain, BaseTypeCode: -1509453 (xFFE8F7B3), PrimaryKey: 0 (x0), UniformHashCode: 1140405894 (0x43F93686)], Full ActivationId: @4f8f45041cf6b69ba6841b68cc474c950000000000000000], grain type: OrleansDashboard.DashboardGrain.
 at Orleans.Runtime.Catalog.GetOrCreateActivation(ActivationAddress address, Boolean newPlacement, String grainType, String genericArguments, Dictionary`2 requestContextData, Task& activatedPromise)
 at Orleans.Runtime.Dispatcher.ReceiveMessage(Message message) 
[2021-01-13 13:08:20.503 GMT 267 Information Orleans.Messaging.Dispatcher.Forwarding Orleans.Messaging]

Basically just cuts off. Last crash dump indicates it is being told to shut down:

Message: FATAL EXCEPTION from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! I should be Dead according to membership table (in CleanupTableEntries): entry = [SiloAddress=S172.30.88.21:22222:348148000 SiloName=Silo_725e5 Status=Dead HostName=VerUatBifrost01 ProxyPort=40000 RoleName= UpdateZone=0 FaultZone=0 StartTime = 2021-01-12 11:46:41.677 GMT IAmAliveTime = 2021-01-13 13:06:51.897 GMT Suspecters = [S172.30.88.23:22222:348142265, S172.30.88.25:22222:348092452] SuspectTimes = [2021-01-13 13:07:48.037 GMT, 2021-01-13 13:07:49.449 GMT]].. Exception: null.

ReubenBond commented 3 years ago

The last message is what I was looking for. Either there is a total communication failure, or the process is indeed experiencing severe performance degradation.

SebFerraro commented 3 years ago

I had to scroll around 20 seconds up in the logfile itself to find the matching log:

[2021-01-13 13:07:59.145 GMT 338    Information Orleans.Messaging.Dispatcher.Forwarding Orleans.Messaging]  Trying to forward Request S172.30.88.21:22222:348148000*cli/14b226a7@94956590->S172.30.88.21:22222:348148000*grn/7EE4A879/00000000+8041314@2cd18fa7 #17826652[ForwardCount=3] from S172.30.88.21:22222:348148000*grn/7EE4A879/00000000+8041314@2cd18fa7 to (null) after Non-existent activation. Attempt 3  
[2021-01-13 13:07:59.145 GMT 338    Warning Orleans.Messaging.Dispatcher.ForwardingFailed   Orleans.Messaging]  Failed to forward message Request S172.30.88.21:22222:348148000*cli/14b226a7@94956590->S172.30.88.21:22222:348148000*grn/7EE4A879/00000000+8041314@2cd18fa7 #17826652[ForwardCount=3] from S172.30.88.21:22222:348148000*grn/7EE4A879/00000000+8041314@2cd18fa7 to (null) after Non-existent activation. Attempt 3  
[2021-01-13 13:07:59.225 GMT 581    Information 100645  Orleans.Runtime.MembershipService.MembershipTableManager]   ProcessTableUpdate (called from Refresh) membership table: 11 silos, 5 are Active, 6 are Dead, Version=<785, 785>. All silos: [SiloAddress=S172.30.88.23:22222:348142265 SiloName=Silo_35d63 Status=Active, SiloAddress=S172.30.88.22:22222:348226231 SiloName=Silo_3e440 Status=Active, SiloAddress=S172.30.88.24:22222:348136609 SiloName=Silo_53ce2 Status=Active, SiloAddress=S172.30.88.26:22222:348226337 SiloName=Silo_789d6 Status=Active, SiloAddress=S172.30.88.25:22222:348092452 SiloName=Silo_9b13c Status=Active, SiloAddress=S172.30.88.24:22222:348092451 SiloName=Silo_dc9e2 Status=Dead, SiloAddress=S172.30.88.25:22222:346788081 SiloName=Silo_70395 Status=Dead, SiloAddress=S172.30.88.23:22222:348092453 SiloName=Silo_71f08 Status=Dead, SiloAddress=S172.30.88.26:22222:348145832 SiloName=Silo_e41f9 Status=Dead, SiloAddress=S172.30.88.21:22222:348148000 SiloName=Silo_725e5 Status=Dead, SiloAddress=S172.30.88.22:22222:348144386 SiloName=Silo_af25c Status=Dead]   
[2021-01-13 13:07:59.225 GMT 581    Warning 100618  Orleans.Runtime.MembershipService.MembershipTableManager]   I should be Dead according to membership table (in CleanupTableEntries): entry = [SiloAddress=S172.30.88.21:22222:348148000 SiloName=Silo_725e5 Status=Dead HostName=VerUatBifrost01 ProxyPort=40000 RoleName= UpdateZone=0 FaultZone=0 StartTime = 2021-01-12 11:46:41.677 GMT IAmAliveTime = 2021-01-13 13:06:51.897 GMT Suspecters = [S172.30.88.23:22222:348142265, S172.30.88.25:22222:348092452] SuspectTimes = [2021-01-13 13:07:48.037 GMT, 2021-01-13 13:07:49.449 GMT]].  
[2021-01-13 13:07:59.225 GMT 581    Error   100627  Orleans.Runtime.MembershipService.MembershipTableManager]   !!!!!!!!!! I have been told I am dead, so this silo will stop! I should be Dead according to membership table (in CleanupTableEntries): entry = [SiloAddress=S172.30.88.21:22222:348148000 SiloName=Silo_725e5 Status=Dead HostName=VerUatBifrost01 ProxyPort=40000 RoleName= UpdateZone=0 FaultZone=0 StartTime = 2021-01-12 11:46:41.677 GMT IAmAliveTime = 2021-01-13 13:06:51.897 GMT Suspecters = [S172.30.88.23:22222:348142265, S172.30.88.25:22222:348092452] SuspectTimes = [2021-01-13 13:07:48.037 GMT, 2021-01-13 13:07:49.449 GMT]].   
[2021-01-13 13:07:59.225 GMT 581    Error   100002  Orleans.Runtime.FatalErrorHandler]  !!!!!!!!!! Fatal error from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! I should be Dead according to membership table (in CleanupTableEntries): entry = [SiloAddress=S172.30.88.21:22222:348148000 SiloName=Silo_725e5 Status=Dead HostName=VerUatBifrost01 ProxyPort=40000 RoleName= UpdateZone=0 FaultZone=0 StartTime = 2021-01-12 11:46:41.677 GMT IAmAliveTime = 2021-01-13 13:06:51.897 GMT Suspecters = [S172.30.88.23:22222:348142265, S172.30.88.25:22222:348092452] SuspectTimes = [2021-01-13 13:07:48.037 GMT, 2021-01-13 13:07:49.449 GMT]].

SebFerraro commented 3 years ago

So it seems other silos have voted it out for some reason - I will try and investigate to see if there's any other reason this could happen our side. Checking Azure the cpu/memory levels of the VM are fine

ReubenBond commented 3 years ago

CPU Usage % wont tell you much - thread pool starvation often causes apparent low CPU % (eg, due to blocking). It might also be something else. A memory dump is likely the easiest way to verify. You could capture one and take a quick look in VS at the Parallel Threads and Parallel Tasks view at the time of the crash. You may also be able to see what interceptors are there. For example, that Dashboard package adds interceptors, so there are at least those.

SebFerraro commented 3 years ago

I'm really struggling with this. Each silo maybe crashes once in a 24 hour period and it is apparent that it's almost impossible to get anything from it, so I'm sort of stumped as to how to progress this any further right now.

ghost commented 2 years ago

We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.

ghost commented 2 years ago

This issue has been marked stale for the past 30 and is being closed due to lack of activity.

dotnet / orleans

Silos Experience RegisterActivationInGrainDirectory Exceptions then crash #6889