Closed SebFerraro closed 2 years ago
Could you please include your Orleans version as well as details about how your cluster is configured? Redacted config code is good for the latter.
Oh of course, apologies:
Orleans v3.3.0 .NET Framework 4.8
I assume when you mean cluster configuration:
builder.Configure<ClusterOptions>(
options =>
{
options.ClusterId = Constants.DeploymentName;
options.ServiceId = "BifrostService";
})
.UseDashboard(options =>
{
options.Host = "*";
options.Port = 8080;
options.HostSelf = true;
options.CounterUpdateIntervalMs = 1000;
});
const string invariant = "System.Data.SqlClient";
var connectionString = Constants.GetConnectionString();
builder
.ConfigureEndpoints(22222, 40000)
.UseAdoNetClustering(options =>
{
options.Invariant = invariant;
options.ConnectionString = connectionString;
})
.UseAdoNetReminderService(options =>
{
options.Invariant = invariant;
options.ConnectionString = connectionString;
})
.UseSiloUnobservedExceptionsHandler();
builder.AddSimpleMessageStreamProvider(
"SMSProvider",
options =>
{
options.FireAndForgetDelivery = true;
options.OptimizeForImmutableData = true;
})
.Configure<ClientMessagingOptions>(options =>
{
options.ResponseTimeout = TimeSpan.FromSeconds(10);
options.ResponseTimeoutWithDebugger = TimeSpan.FromSeconds(10);
})
.AddMemoryGrainStorageAsDefault()
.AddMemoryGrainStorage("PubSubStore")
.UseServiceProviderFactory(new AutofacServiceProviderFactory())
.ConfigureContainer<ContainerBuilder>(this.ConfigureContainer)
.ConfigureLogging(this.ConfigureLogging)
.ConfigureApplicationParts(parts =>
parts.AddApplicationPart(typeof(UnitGrain).Assembly).AddApplicationPart(typeof(IUnitGrain).Assembly));
this.siloHost = builder.Build();
Are there any other components being configured, such as grain call filters? Are you able to provide more logs? Some of the errors you're seeing appear to be the sort of thing you'd see if you were having severe performance degradation. If you upgrade to Orleans v3.4.0, then you should see extra warning logs if you do have severe perf issues (such as caused by thread pool starvation).
I'll update to that right away. I only pushed the update to 3.3.0 a week or so back, didn't realise 3.4.0 was so close. Will report back with errors once I've done that. I can provide the entire logfile (~300mb) for the latest crash if that helps at all. It's roughly 24 hours worth of logs I think
When the silo crashes, are there any last log messages printed? Do you have a crash dump that I could analyze?
Unfortunately the last logs are the above:
[2021-01-13 13:08:20.503 GMT 319 Information 100147 Orleans.Runtime.Dispatcher] Intermediate NonExistentActivation for message OneWay S172.30.88.21:22222:348148000*cli/30873788@06548493->S172.30.88.21:22222:348148000*grn/FFE8F7B3/00000000@cc474c95 #59082119[ForwardCount=1], with Exception Orleans.Runtime.Catalog+NonExistentActivationException: Non-existent activation: [ActivationAddress: S172.30.88.21:22222:348148000*grn/FFE8F7B3/00000000@cc474c95, Full GrainId: [GrainId: *grn/FFE8F7B3/0000000000000000000000000000000003ffffffffe8f7b3-0x43F93686, IdCategory: Grain, BaseTypeCode: -1509453 (xFFE8F7B3), PrimaryKey: 0 (x0), UniformHashCode: 1140405894 (0x43F93686)], Full ActivationId: @4f8f45041cf6b69ba6841b68cc474c950000000000000000], grain type: OrleansDashboard.DashboardGrain.
at Orleans.Runtime.Catalog.GetOrCreateActivation(ActivationAddress address, Boolean newPlacement, String grainType, String genericArguments, Dictionary`2 requestContextData, Task& activatedPromise)
at Orleans.Runtime.Dispatcher.ReceiveMessage(Message message)
[2021-01-13 13:08:20.503 GMT 267 Information Orleans.Messaging.Dispatcher.Forwarding Orleans.Messaging]
Basically just cuts off. Last crash dump indicates it is being told to shut down:
Message: FATAL EXCEPTION from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! I should be Dead according to membership table (in CleanupTableEntries): entry = [SiloAddress=S172.30.88.21:22222:348148000 SiloName=Silo_725e5 Status=Dead HostName=VerUatBifrost01 ProxyPort=40000 RoleName= UpdateZone=0 FaultZone=0 StartTime = 2021-01-12 11:46:41.677 GMT IAmAliveTime = 2021-01-13 13:06:51.897 GMT Suspecters = [S172.30.88.23:22222:348142265, S172.30.88.25:22222:348092452] SuspectTimes = [2021-01-13 13:07:48.037 GMT, 2021-01-13 13:07:49.449 GMT]].. Exception: null.
The last message is what I was looking for. Either there is a total communication failure, or the process is indeed experiencing severe performance degradation.
I had to scroll around 20 seconds up in the logfile itself to find the matching log:
[2021-01-13 13:07:59.145 GMT 338 Information Orleans.Messaging.Dispatcher.Forwarding Orleans.Messaging] Trying to forward Request S172.30.88.21:22222:348148000*cli/14b226a7@94956590->S172.30.88.21:22222:348148000*grn/7EE4A879/00000000+8041314@2cd18fa7 #17826652[ForwardCount=3] from S172.30.88.21:22222:348148000*grn/7EE4A879/00000000+8041314@2cd18fa7 to (null) after Non-existent activation. Attempt 3
[2021-01-13 13:07:59.145 GMT 338 Warning Orleans.Messaging.Dispatcher.ForwardingFailed Orleans.Messaging] Failed to forward message Request S172.30.88.21:22222:348148000*cli/14b226a7@94956590->S172.30.88.21:22222:348148000*grn/7EE4A879/00000000+8041314@2cd18fa7 #17826652[ForwardCount=3] from S172.30.88.21:22222:348148000*grn/7EE4A879/00000000+8041314@2cd18fa7 to (null) after Non-existent activation. Attempt 3
[2021-01-13 13:07:59.225 GMT 581 Information 100645 Orleans.Runtime.MembershipService.MembershipTableManager] ProcessTableUpdate (called from Refresh) membership table: 11 silos, 5 are Active, 6 are Dead, Version=<785, 785>. All silos: [SiloAddress=S172.30.88.23:22222:348142265 SiloName=Silo_35d63 Status=Active, SiloAddress=S172.30.88.22:22222:348226231 SiloName=Silo_3e440 Status=Active, SiloAddress=S172.30.88.24:22222:348136609 SiloName=Silo_53ce2 Status=Active, SiloAddress=S172.30.88.26:22222:348226337 SiloName=Silo_789d6 Status=Active, SiloAddress=S172.30.88.25:22222:348092452 SiloName=Silo_9b13c Status=Active, SiloAddress=S172.30.88.24:22222:348092451 SiloName=Silo_dc9e2 Status=Dead, SiloAddress=S172.30.88.25:22222:346788081 SiloName=Silo_70395 Status=Dead, SiloAddress=S172.30.88.23:22222:348092453 SiloName=Silo_71f08 Status=Dead, SiloAddress=S172.30.88.26:22222:348145832 SiloName=Silo_e41f9 Status=Dead, SiloAddress=S172.30.88.21:22222:348148000 SiloName=Silo_725e5 Status=Dead, SiloAddress=S172.30.88.22:22222:348144386 SiloName=Silo_af25c Status=Dead]
[2021-01-13 13:07:59.225 GMT 581 Warning 100618 Orleans.Runtime.MembershipService.MembershipTableManager] I should be Dead according to membership table (in CleanupTableEntries): entry = [SiloAddress=S172.30.88.21:22222:348148000 SiloName=Silo_725e5 Status=Dead HostName=VerUatBifrost01 ProxyPort=40000 RoleName= UpdateZone=0 FaultZone=0 StartTime = 2021-01-12 11:46:41.677 GMT IAmAliveTime = 2021-01-13 13:06:51.897 GMT Suspecters = [S172.30.88.23:22222:348142265, S172.30.88.25:22222:348092452] SuspectTimes = [2021-01-13 13:07:48.037 GMT, 2021-01-13 13:07:49.449 GMT]].
[2021-01-13 13:07:59.225 GMT 581 Error 100627 Orleans.Runtime.MembershipService.MembershipTableManager] !!!!!!!!!! I have been told I am dead, so this silo will stop! I should be Dead according to membership table (in CleanupTableEntries): entry = [SiloAddress=S172.30.88.21:22222:348148000 SiloName=Silo_725e5 Status=Dead HostName=VerUatBifrost01 ProxyPort=40000 RoleName= UpdateZone=0 FaultZone=0 StartTime = 2021-01-12 11:46:41.677 GMT IAmAliveTime = 2021-01-13 13:06:51.897 GMT Suspecters = [S172.30.88.23:22222:348142265, S172.30.88.25:22222:348092452] SuspectTimes = [2021-01-13 13:07:48.037 GMT, 2021-01-13 13:07:49.449 GMT]].
[2021-01-13 13:07:59.225 GMT 581 Error 100002 Orleans.Runtime.FatalErrorHandler] !!!!!!!!!! Fatal error from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! I should be Dead according to membership table (in CleanupTableEntries): entry = [SiloAddress=S172.30.88.21:22222:348148000 SiloName=Silo_725e5 Status=Dead HostName=VerUatBifrost01 ProxyPort=40000 RoleName= UpdateZone=0 FaultZone=0 StartTime = 2021-01-12 11:46:41.677 GMT IAmAliveTime = 2021-01-13 13:06:51.897 GMT Suspecters = [S172.30.88.23:22222:348142265, S172.30.88.25:22222:348092452] SuspectTimes = [2021-01-13 13:07:48.037 GMT, 2021-01-13 13:07:49.449 GMT]].
So it seems other silos have voted it out for some reason - I will try and investigate to see if there's any other reason this could happen our side. Checking Azure the cpu/memory levels of the VM are fine
CPU Usage % wont tell you much - thread pool starvation often causes apparent low CPU % (eg, due to blocking). It might also be something else. A memory dump is likely the easiest way to verify. You could capture one and take a quick look in VS at the Parallel Threads and Parallel Tasks view at the time of the crash. You may also be able to see what interceptors are there. For example, that Dashboard package adds interceptors, so there are at least those.
I'm really struggling with this. Each silo maybe crashes once in a 24 hour period and it is apparent that it's almost impossible to get anything from it, so I'm sort of stumped as to how to progress this any further right now.
We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.
This issue has been marked stale for the past 30 and is being closed due to lack of activity.
We have seen our uat cluster of 6 silos experience many RegisterActivationInGrainDirectory exceptions for around 30 seconds or so, then crash.
It is seemingly random:
https://gist.github.com/SebFerraro/e4220dcb35fd680dee2d1b2438f4a318