dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.13k stars 2.04k forks source link

Ungraceful shutdown leading to indefinite startup failures #8353

Open jsteinich opened 1 year ago

jsteinich commented 1 year ago

Observed environment:

Scenario:

Specific issues:

Specific exception trace:

Unhandled exception. Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException
: Unable to connect to endpoint S127.0.0.1:22254:417104037. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 127.0.0.1:22254. Error: ConnectionRefused
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken)
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint)
   at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|9_0(ValueTask`1 c, Message m)
   at Orleans.Runtime.OutgoingCallInvoker.Invoke()
   at Orleans.Runtime.OutgoingCallInvoker.Invoke()
   at Orleans.Runtime.GrainReferenceRuntime.InvokeWithFilters(GrainReference reference, InvokeMethodRequest request, String debugContext, InvokeMethodOptions options)
   at Orleans.Internal.OrleansTaskExtentions.<ToTypedTask>g__ConvertAsync|4_0[T](Task`1 asyncTask)
   at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount)
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem`1.Execute()
   at Orleans.Runtime.Placement.RandomPlacementDirector.OnSelectActivation(PlacementStrategy strategy, GrainId target, IPlacementRuntime context)
   at Orleans.Runtime.Placement.PlacementDirectorsManager.SelectOrAddActivation(ActivationAddress sendingAddress, PlacementTarget targetGrain, IPlacementRuntime context, PlacementStrategy strategy)
   at Orleans.Runtime.Dispatcher.AddressMessageAsync(Message message, PlacementTarget target, PlacementStrategy strategy, ActivationAddress targetAddress)
   at Orleans.Runtime.Dispatcher.<>c__DisplayClass36_0.<<AsyncSendMessage>g__TransportMessageAferSending|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Orleans.Runtime.OutgoingCallInvoker.Invoke()
   at Orleans.Runtime.OutgoingCallInvoker.Invoke()
   at Orleans.Runtime.GrainReferenceRuntime.InvokeWithFilters(GrainReference reference, InvokeMethodRequest request, String debugContext, InvokeMethodOptions options)
   at Orleans.Internal.OrleansTaskExtentions.<ToTypedTask>g__ConvertAsync|4_0[T](Task`1 asyncTask)
   at PerBlue.Common.GameServer.Grains.Configuration.RuntimeConfigurationService.RequestInitialConfiguration()
   at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct)
   at Orleans.LifecycleSubject.<OnStart>g__CallOnStart|7_0(Int32 stage, OrderedObserver observer, CancellationToken cancellationToken)
   at Orleans.LifecycleSubject.OnStart(CancellationToken ct)
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()
   at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken)
   at Orleans.Hosting.SiloWrapper.StartAsync(CancellationToken cancellationToken)
   at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
   at ...
jsteinich commented 1 year ago

I attempted just doing a graceful shutdown on any startup task failure. This does result in skipping the 10 minute wait cycle, but does not actually resolve the issue.

Looking into a bit further, I can see that the RandomPlacementDirector calls into the SiloStatusOracle to get active silos which simply looks at the membership status: https://github.com/dotnet/orleans/blob/ec31259418fcc574d575bbb70427719d18cc522d/src/Orleans.Runtime/MembershipService/SiloStatusOracle.cs#L74

The bad silo is also unable to be voted dead as the new silo shuts down before it gets a chance to run silo probes.

ReubenBond commented 1 year ago

There were quite a few changes between 3.0.2 and 3.6.5, one of which may have rectified this issue. Is there something preventing an upgrade? I recommend that before diving too deeply into this

jsteinich commented 1 year ago

There were quite a few changes between 3.0.2 and 3.6.5, one of which may have rectified this issue. Is there something preventing an upgrade? I recommend that before diving too deeply into this

I attempted a quick upgrade, but ran into some dependency conflicts. I also see that there are some breaking changes.

I'm hoping that we'll be able to start an upgrade to 7.x in the near future, but if that fails to materialize, I'll revisit the 3.6.5 update.

jsteinich commented 1 year ago

@ReubenBond I tested this again after upgrading to Orleans 7 and the behavior is the same.

I currently have a workaround of wrapping startup tasks and keeping track of failure status. If any failed, I'm using IClusterMembershipService.TryKill to cleanup bad entries so that the next startup is successful.

ReubenBond commented 1 year ago

Startup task failure leading to ungraceful shutdown. I can wrap the startup tasks, but perhaps this scenario could be handled by a more graceful shutdown.

I think we should implement this. We shouldn't be ungracefully shutting down the silo just because application code failed.

jsteinich commented 1 year ago

I think we should implement this. We shouldn't be ungracefully shutting down the silo just because application code failed.

There also appears to be an issue with the grain directory not respecting the "I am alive" timeout that the membership system uses (not explicitly application code failure). Here's an updated trace of that:

Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S127.0.0.1:22253:40146827. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 127.0.0.1:22253. Error: ConnectionRefused
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 54
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 193
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 221
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 74
   at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 727
   at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 30
   at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 338
   at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync|40_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 439
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 74
JorgeCandeias commented 1 year ago

This looks very similar to what we're experiencing with an app on v3.7.1. The difference is we're using SQL Server for everything. I also see the grain directory message above in some stack traces, though that is but one of different variations.

hankovich commented 10 months ago

Any progress on this? Also see such exceptions. I use k8s clustering with orleans 8

ReubenBond commented 10 months ago

@hankovich does your application ever start? Are you able to provide more detail, potentially including logs?