dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.1k stars 2.03k forks source link

LocalSiloHealthMonitor doesn't terminate node #8186

Open christallire opened 1 year ago

christallire commented 1 year ago

Hello, I've upgraded to orleans 7 and experiencing some odd situations and one thing is LocalSiloHealthMonitor does not kill the silo.

I was narrowing down why the node is failed to respond to the probe after the upgrade but this is worse since I can't expect to restart the node automatically and the service just stops until I manually restart the node.. :|

{"@timestamp":"2022-11-24T04:59:59.6208442+00:00","log.level":"Warning","message":"This silo has not received a probe request since 11/23/2022 00:06:06","metadata":{"message_template":"This silo has not received a probe request since {LastProbeRequest}","last_probe_request":"2022-11-23T00:06:06.6389545Z"},"ecs":{"version":"1.5.0"},"event":{"severity":3,"timezone":"Coordinated Universal Time","created":"2022-11-24T04:59:59.6208442+00:00"},"log":{"logger":"Orleans.Runtime.MembershipService.LocalSiloHealthMonitor","original":null},"process":{"thread":{"id":28},"pid":1,"name":"dotnet","executable":"dotnet"}}

Note the time, the log is printed on 2022-11-24T04:59:59.6208442 and last probe was 2022-11-23T00:06:06.6389545Z node hasn't been killed for almost 28hrs and just logging same thing over and over

this is OrleansMembershipTable from SQLServer

DeploymentId    Address Port    Generation  SiloName    HostName    Status  ProxyPort   SuspectTimes    StartTime   IAmAliveTime
grey    10.0.11.180 11111   28113443    product-service-65bf5f9847-285bj    product-service-65bf5f9847-285bj    6   30000   10.0.16.24:11111@28113661,2022-11-23 00:05:53.019 GMT   2022-11-22 09:17:24.060 2022-11-23 00:02:27.830

From the doc:

https://learn.microsoft.com/en-us/dotnet/orleans/deployment/kubernetes

Orleans uses a cluster membership protocol to promptly detect and recover from a process or network failures. Each node monitors a subset of other nodes, sending periodic probes. If a node fails to respond to multiple successive probes from multiple other nodes, then it will be forcibly removed from the cluster. Once a failed node learns that it has been removed, it terminates immediately. Kubernetes will restart the terminated process and it will attempt to rejoin the cluster.

Is there something I'm missing to terminate the node and restart in orleans 7?

christallire commented 1 year ago

@ReubenBond could you please have this a look?

ReubenBond commented 1 year ago

LocalSiloHealthMonitor is not supposed to kill the silo. It's used to warn you about issues (connectivity, thread pool, etc) and prevent unhealthy silos from voting healthy silos out of the cluster.

Silos are only kicked out of the cluster by other silos, after a number of consecutively failed probes.

In your case, are you trying a rolling upgrade from 3.x to 7.0? How many silos are in the cluster?

christallire commented 1 year ago

I have approx. 40 silos.

Silos are only kicked out of the cluster by other silos, after a number of consecutively failed probes.

Oh really? I thought it crashed in 3.0 because I had never had this issue before because the pod just silently restarted.

So, according to your message, the silo probably stopped receiving ping because the silo stopped for a period of time for whatever reasons (GC, high utilization, bugs) and got stuck there. hmm.

ReubenBond commented 1 year ago

The logs will provide more insight into what's happening. We can help you to diagnose the issue using logs. Generally, silos learn they have been evicted by reading the membership table. Upon seeing that they have been evicted, the silo process will crash itself via Environment.FailFast.

In this case, your silo has been marked dead (Status = 6 and you see the silo which evicted it listed there) but possibly has not yet refreshed its membership to learn of that fact. Perhaps the entire process has locked up for some reason. Logs and possibly a memory dump would help to identify what's actually happening. If the host process has completely frozen (which may not be the case here), then no code running in the process will be able to terminate it. In that case, the Kubernetes hosting package can allow other silos to delete the silo's pod from Kubernetes once it's been evicted from the cluster and/or you can have a local Kubernetes liveness probe return a simple 200 OK to ensure the process is actually alive.

Logs are the first avenue to investigate.

christallire commented 1 year ago

Okay, I've managed to narrow it down and this is very interesting.

It seems Environment.FailFast doesn't work.

Here's what I did. 1) Found out Environment.FailFast in FatalErrorHandler.cs doesn't work. because FATAL EXCEPTION from ... log printed on the console but other threads still running (especially for the LocalSiloHealthMonitor, it is keep spammed after Environment.FailFast)

2) Implemented my own FatalErrorHandler to see what is exactly wrong, like below:

        // Allow some time for loggers to flush.
        Console.Error.WriteLine("FATAL EXCEPTION: BEFORE SLEEP");
        Thread.Sleep(2000);
        Console.Error.WriteLine("FATAL EXCEPTION: AFTER SLEEP");

        if (Debugger.IsAttached) Debugger.Break();

        Console.Error.WriteLine("FATAL EXCEPTION: BEFORE FAIL FAST");
        Environment.FailFast(msg, exception);
        Console.Error.WriteLine("FATAL EXCEPTION: AM I STILL ALIVE?");

and result were same, got FATAL EXCEPTION: BEFORE FAIL FAST but not FATAL EXCEPTION: AM I STILL ALIVE?.

$ kubectl logs -f account-service-5b8fcc48c9-cqzvk app | grep FATAL
FATAL ERROR HANDLER INITIATED.
FATAL EXCEPTION from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! Reason: I should be Dead according to membership table (in CleanupTableEntries): entry = [SiloAddress=S10.0.15.87:11111:32263975 SiloName=account-service-5b8fcc48c9-cqzvk Status=Dead HostName=account-service-5b8fcc48c9-cqzvk ProxyPort=30000 RoleName= UpdateZone=0 FaultZone=0 StartTime=2023-01-09 10:12:57.437 GMT IAmAliveTime=2023-01-09 10:13:07.723 GMT Suspecters=[S10.0.9.100:11111:32264074] SuspectTimes=[2023-01-09 10:14:36.461 GMT]].. Exception: null.\nCurrent stack:    at System.Environment.get_StackTrace()
FATAL EXCEPTION: BEFORE SLEEP
FATAL EXCEPTION: AFTER SLEEP
FATAL EXCEPTION: BEFORE FAIL FAST
Process terminated. FATAL EXCEPTION from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! Reason: I should be Dead according to membership table (in CleanupTableEntries): entry = [SiloAddress=S10.0.15.87:11111:32263975 SiloName=account-service-5b8fcc48c9-cqzvk Status=Dead HostName=account-service-5b8fcc48c9-cqzvk ProxyPort=30000 RoleName= UpdateZone=0 FaultZone=0 StartTime=2023-01-09 10:12:57.437 GMT IAmAliveTime=2023-01-09 10:13:07.723 GMT Suspecters=[S10.0.9.100:11111:32264074] SuspectTimes=[2023-01-09 10:14:36.461 GMT]].. Exception: null.\nCurrent stack:    at System.Environment.get_StackTrace()

(I get CleanupTableEntries and also other two kinds of dead messages)

but still spams the log

[10:17:27 ERR] Could not deliver reminder tick for [optimizationReminder, productoptionoptimization/157, 00:30:00, 2023-01-09 06:17:27.783 GMT, 1780, 3695, Ticking], next 01/09/2023 10:47:27.
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.0.9.100:11111:32264074, will retry after 198.1379ms
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 90
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 75
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 76
   at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 739
   at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 30
   at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 357
   at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync|40_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 448
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 90
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 75
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync(GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 83
   at Orleans.Runtime.ReminderService.LocalReminderService.LocalReminderData.OnTimerTick() in /_/src/Orleans.Reminders/ReminderService/LocalReminderService.cs:line 714

or

[10:18:58 WRN] This silo is not active (Status: Dead) and is therefore not healthy.
[10:18:58 WRN] Self-monitoring determined that local health is degraded. Degradation score is 8/8 (lower is better). Complaints: This silo is not active (Status: Dead and is therefore not healthy.
[10:19:01 INF] Establishing connection to endpoint S10.0.9.100:11111:32264074
[10:19:01 INF] Establishing connection to endpoint S10.0.9.48:11111:32264076

or

[10:40:17 WRN] Error retrieving silo manifest for silo S10.0.9.48:11111:32264076
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.0.9.48:11111:32264076. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.48:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 90
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 75
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 76
   at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 163
[10:40:19 WRN] This silo is not active (Status: Dead) and is therefore not healthy.
[10:40:19 WRN] Self-monitoring determined that local health is degraded. Degradation score is 8/8 (lower is better). Complaints: This silo is not active (Status: Dead and is therefore not healthy.
[10:40:21 INF] Application is shutting down...
[10:40:21 INF] Stopping Orleans Silo
[10:40:21 INF] Stopping Orleans.Runtime.ReminderService.LocalReminderService grain service
[10:40:22 INF] Establishing connection to endpoint S10.0.9.100:11111:32264074
[10:40:22 INF] Establishing connection to endpoint S10.0.9.48:11111:32264076
[10:40:26 WRN] Connection attempt to endpoint S10.0.9.48:11111:32264076 failed
Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.48:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
[10:40:26 WRN] Connection attempt to endpoint S10.0.9.100:11111:32264074 failed
Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.100:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
[10:40:26 WRN] Error retrieving silo manifest for silo S10.0.9.100:11111:32264074
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.0.9.100:11111:32264074. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.100:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 90
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 75
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 76
   at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 163
[10:40:26 WRN] Error retrieving silo manifest for silo S10.0.9.48:11111:32264076
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.0.9.48:11111:32264076. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.48:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 90
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 75
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 76
   at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 163
[10:40:26 WRN] I should be Dead according to membership table (in TryUpdateMyStatusGlobalOnce): Entry = [SiloAddress=S10.0.15.87:11111:32263975 SiloName=account-service-5b8fcc48c9-cqzvk Status=Dead HostName=account-service-5b8fcc48c9-cqzvk ProxyPort=30000 RoleName= UpdateZone=0 FaultZone=0 StartTime=2023-01-09 10:12:57.437 GMT IAmAliveTime=2023-01-09 10:13:07.723 GMT Suspecters=[S10.0.9.100:11111:32264074] SuspectTimes=[2023-01-09 10:14:36.461 GMT]].
[10:40:26 ERR] I have been told I am dead, so this silo will stop! Reason: I should be Dead according to membership table (in TryUpdateMyStatusGlobalOnce): Entry = [SiloAddress=S10.0.15.87:11111:32263975 SiloName=account-service-5b8fcc48c9-cqzvk Status=Dead HostName=account-service-5b8fcc48c9-cqzvk ProxyPort=30000 RoleName= UpdateZone=0 FaultZone=0 StartTime=2023-01-09 10:12:57.437 GMT IAmAliveTime=2023-01-09 10:13:07.723 GMT Suspecters=[S10.0.9.100:11111:32264074] SuspectTimes=[2023-01-09 10:14:36.461 GMT]].
[10:40:26 ERR] Fatal error from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! Reason: I should be Dead according to membership table (in TryUpdateMyStatusGlobalOnce): Entry = [SiloAddress=S10.0.15.87:11111:32263975 SiloName=account-service-5b8fcc48c9-cqzvk Status=Dead HostName=account-service-5b8fcc48c9-cqzvk ProxyPort=30000 RoleName= UpdateZone=0 FaultZone=0 StartTime=2023-01-09 10:12:57.437 GMT IAmAliveTime=2023-01-09 10:13:07.723 GMT Suspecters=[S10.0.9.100:11111:32264074] SuspectTimes=[2023-01-09 10:14:36.461 GMT]].
FATAL EXCEPTION from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! Reason: I should be Dead according to membership table (in TryUpdateMyStatusGlobalOnce): Entry = [SiloAddress=S10.0.15.87:11111:32263975 SiloName=account-service-5b8fcc48c9-cqzvk Status=Dead HostName=account-service-5b8fcc48c9-cqzvk ProxyPort=30000 RoleName= UpdateZone=0 FaultZone=0 StartTime=2023-01-09 10:12:57.437 GMT IAmAliveTime=2023-01-09 10:13:07.723 GMT Suspecters=[S10.0.9.100:11111:32264074] SuspectTimes=[2023-01-09 10:14:36.461 GMT]].. Exception: null.\nCurrent stack:    at System.Environment.get_StackTrace()
   at Grey.MicroserviceFramework.ErrorHandler.FatalErrorHandler.OnFatalException(Object sender, String context, Exception exception) in /src/Grey.MicroserviceFramework/ErrorHandler/FatalErrorHandler.cs:line 45
   at Orleans.Runtime.MembershipService.MembershipTableManager.KillMyselfLocally(String reason) in /_/src/Orleans.Runtime/MembershipService/MembershipTableManager.cs:line 618
   at Orleans.Runtime.MembershipService.MembershipTableManager.TryUpdateMyStatusGlobalOnce(SiloStatus newStatus) in /_/src/Orleans.Runtime/MembershipService/MembershipTableManager.cs:line 420
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Threading.Tasks.TaskSchedulerAwaitTaskContinuation.<>c.<Run>b__2_0(Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
   at System.Threading.Tasks.Task.ExecuteEntry()
   at Orleans.Runtime.Scheduler.ActivationTaskScheduler.TryExecuteTaskInline(Task task, Boolean taskWasPreviouslyQueued) in /_/src/Orleans.Runtime/Scheduler/ActivationTaskScheduler.cs:line 117
   at System.Threading.Tasks.TaskScheduler.TryRunInline(Task task, Boolean taskWasPreviouslyQueued)
   at System.Threading.Tasks.TaskContinuation.InlineIfPossibleOrElseQueue(Task task, Boolean needsProtection)
   at System.Threading.Tasks.TaskSchedulerAwaitTaskContinuation.Run(Task ignored, Boolean canInlineContinuationTask)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetExistingTaskResult(Task`1 task, TResult result)
   at Orleans.Runtime.MembershipService.AdoNetClusteringTable.ReadAll() in /_/src/AdoNet/Orleans.Clustering.AdoNet/Messaging/AdoNetClusteringTable.cs:line 83
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Threading.Tasks.TaskSchedulerAwaitTaskContinuation.<>c.<Run>b__2_0(Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
   at System.Threading.Tasks.Task.ExecuteEntry()
   at Orleans.Runtime.Scheduler.ActivationTaskScheduler.TryExecuteTaskInline(Task task, Boolean taskWasPreviouslyQueued) in /_/src/Orleans.Runtime/Scheduler/ActivationTaskScheduler.cs:line 117
   at System.Threading.Tasks.TaskScheduler.TryRunInline(Task task, Boolean taskWasPreviouslyQueued)
   at System.Threading.Tasks.TaskContinuation.InlineIfPossibleOrElseQueue(Task task, Boolean needsProtection)
   at System.Threading.Tasks.TaskSchedulerAwaitTaskContinuation.Run(Task ignored, Boolean canInlineContinuationTask)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetExistingTaskResult(Task`1 task, TResult result)
   at Orleans.Clustering.AdoNet.Storage.RelationalOrleansQueries.ReadAsync[TResult,TAggregate](String query, Func`2 selector, Func`2 parameterProvider, Func`2 aggregator) in /_/src/AdoNet/Shared/Storage/RelationalOrleansQueries.cs:line 86
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
   at System.Threading.Tasks.TaskSchedulerAwaitTaskContinuation.<>c.<Run>b__2_0(Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
   at System.Threading.Tasks.Task.ExecuteEntry()
   at Orleans.Runtime.Scheduler.ActivationTaskScheduler.RunTask(Task task) in /_/src/Orleans.Runtime/Scheduler/ActivationTaskScheduler.cs:line 42
   at Orleans.Runtime.Scheduler.WorkItemGroup.Execute() in /_/src/Orleans.Runtime/Scheduler/WorkItemGroup.cs:line 207
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
FATAL EXCEPTION: BEFORE SLEEP
FATAL EXCEPTION: AFTER SLEEP
FATAL EXCEPTION: BEFORE FAIL FAST
[10:40:31 INF] Establishing connection to endpoint S10.0.9.100:11111:32264074
[10:40:31 INF] Establishing connection to endpoint S10.0.9.48:11111:32264076
[10:40:32 WRN] Connection attempt to endpoint S10.0.9.48:11111:32264076 failed
Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.48:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
[10:40:32 WRN] Connection attempt to endpoint S10.0.9.100:11111:32264074 failed
Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.100:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
[10:40:32 WRN] Error retrieving silo manifest for silo S10.0.9.100:11111:32264074
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.0.9.100:11111:32264074. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.100:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 90
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 75
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 76
   at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 163
[10:40:32 WRN] Error retrieving silo manifest for silo S10.0.9.48:11111:32264076
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.0.9.48:11111:32264076. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.48:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 90
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 75
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 76
   at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 163
[10:40:37 INF] Establishing connection to endpoint S10.0.9.100:11111:32264074
[10:40:37 INF] Establishing connection to endpoint S10.0.9.48:11111:32264076
[10:40:37 WRN] Connection attempt to endpoint S10.0.9.48:11111:32264076 failed
Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.48:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
[10:40:37 WRN] Connection attempt to endpoint S10.0.9.100:11111:32264074 failed
Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.100:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
[10:40:37 WRN] Error retrieving silo manifest for silo S10.0.9.100:11111:32264074
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.0.9.100:11111:32264074. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.100:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 90
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 75
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 76
   at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 163
[10:40:37 WRN] Error retrieving silo manifest for silo S10.0.9.48:11111:32264076
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.0.9.48:11111:32264076. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.48:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 90
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 75
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 76
   at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 163
[10:40:42 INF] Establishing connection to endpoint S10.0.9.100:11111:32264074
[10:40:42 INF] Establishing connection to endpoint S10.0.9.48:11111:32264076
[10:40:45 WRN] Connection attempt to endpoint S10.0.9.48:11111:32264076 failed
Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.48:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
[10:40:45 WRN] Connection attempt to endpoint S10.0.9.100:11111:32264074 failed
Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.100:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
[10:40:45 WRN] Error retrieving silo manifest for silo S10.0.9.100:11111:32264074
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.0.9.100:11111:32264074. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.100:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 90
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 75
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 76
   at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 163
[10:40:45 WRN] Error retrieving silo manifest for silo S10.0.9.48:11111:32264076
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.0.9.48:11111:32264076. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 10.0.9.48:11111. Error: HostUnreachable
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 64
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 90
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 75
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 129
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 76
   at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 163
[10:40:50 INF] Establishing connection to endpoint S10.0.9.100:11111:32264074
[10:40:50 INF] Establishing connection to endpoint S10.0.9.48:11111:32264076
  1. so I went to the container to dotnet-dump to get the dump and want to see how threads are alive.
    
    root@account-service-5b8fcc48c9-cqzvk:/tmp# dotnet --info

Host: Version: 7.0.1 Architecture: arm64 Commit: 97203d38ba

.NET SDKs installed: No SDKs were found.

.NET runtimes installed: Microsoft.AspNetCore.App 7.0.1 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 7.0.1 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

Other architectures found: None

Environment variables: Not set

global.json file: Not found

Learn more: https://aka.ms/dotnet/info

Download .NET: https://aka.ms/dotnet/download

root@account-service-5b8fcc48c9-cqzvk:/tmp# dotnet tool install --global dotnet-dump The command could not be loaded, possibly because:

Download a .NET SDK: https://aka.ms/dotnet/download

Learn about SDK resolution: https://aka.ms/dotnet/sdk-not-found


Unfortunately, It seemed dotnet-dump requires SDK.

4. before rebuild the image with SDK, I wrote a simple `Environment.FailFast` with 1 foreground thread program to see If it is something wrong with the runtime image:

root@account-service-5b8fcc48c9-cqzvk:/tmp# ls -al total 328 drwxrwxrwt 1 root root 161 Jan 9 10:23 . drwxr-xr-x 1 root root 39 Jan 9 10:12 .. -rwxrwxrwx 1 root root 151064 Jan 9 10:23 ConsoleApp3 -rwxrwxrwx 1 root root 403 Jan 9 10:23 ConsoleApp3.deps.json -rwxrwxrwx 1 root root 5120 Jan 9 10:23 ConsoleApp3.dll -rwxrwxrwx 1 root root 153600 Jan 9 10:23 ConsoleApp3.exe -rwxrwxrwx 1 root root 10552 Jan 9 10:23 ConsoleApp3.pdb -rwxrwxrwx 1 root root 139 Jan 9 10:23 ConsoleApp3.runtimeconfig.json root@account-service-5b8fcc48c9-cqzvk:/tmp# ./ConsoleApp3.exe bash: ./ConsoleApp3.exe: cannot execute binary file: Exec format error root@account-service-5b8fcc48c9-cqzvk:/tmp# dotnet ConsoleApp3.dll Hello, World! Process terminated. hello? at System.Environment.FailFast(System.String) at Program.

$(System.String[]) Aborted (core dumped)



It crashed.

5. unfortunately dotnet-dump is not working even with SDKS so I stopped investigate here.
ReubenBond commented 1 year ago

Interesting. Thanks for investigating. I wonder if injecting your own IHostApplicationLifecycle into IFatalExceptionHandler and terminating the application that way works.

What base image/distro are you using? Is your process running under a debugger?

In the past, when we've needed to diagnose issues with processes running in containers using diagnostics tools, we've installed the SDK into the container on the fly, in the base image, or configured a dotnet-monitor sidecar container.

christallire commented 1 year ago

What base image/distro are you using?

mcr.microsoft.com/dotnet/aspnet:7.0-jammy-arm64v8 (ubuntu) and Amazon Linux

/app# uname -a
Linux account-service-569c9ccb97-9qkgh 5.4.226-129.415.amzn2.aarch64 #1 SMP Fri Dec 9 12:54:10 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Is your process running under a debugger?

nope

In the past, when we've needed to diagnose issues with processes running in containers using diagnostics tools, we've installed the SDK into the container on the fly, in the base image, or configured a dotnet-monitor sidecar container.

Thanks for the advice, I've tried it too but dotnet-dump ps doesn't detect any dotnet processes in the environment even with root priviledge. weird.