dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.09k stars 2.03k forks source link

ArgumentNullException thrown when stopping the silo #4965

Closed pashchuk closed 6 years ago

pashchuk commented 6 years ago

When the silo is starting to shut down gracefully, at some point strange exceptions has been thrown. The most important ones:

Runtime.Messaging.IncomingMessageAgent/Ping[100026]
      Asynch agent Runtime.Messaging.IncomingMessageAgent/Ping encountered unexpected exception The Stage will be restarted.
System.ArgumentNullException: Value cannot be null.
   at System.Threading.Monitor.Exit(Object obj)
   at System.Threading.SemaphoreSlim.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Collections.Concurrent.BlockingCollection`1.TryTakeWithNoTimeValidation(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken, CancellationTokenSource combinedTokenSource)
   at System.Collections.Concurrent.BlockingCollection`1.TryTake(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Collections.Concurrent.BlockingCollection`1.Take()
   at Orleans.Runtime.Messaging.InboundMessageQueue.WaitMessage(Categories type)
   at Orleans.Runtime.Messaging.MessageCenter.WaitMessage(Categories type, CancellationToken ct)
   at Orleans.Runtime.Messaging.IncomingMessageAgent.Run()
   at Orleans.Threading.FiltersApplicant`1.Apply(T action)

and

Runtime.Messaging.IncomingMessageAgent/Application[100026]
      Asynch agent Runtime.Messaging.IncomingMessageAgent/Application encountered unexpected exception The Stage will be restarted.
System.ArgumentNullException: Value cannot be null.
   at System.Threading.Monitor.Exit(Object obj)
   at System.Threading.SemaphoreSlim.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Collections.Concurrent.BlockingCollection`1.TryTakeWithNoTimeValidation(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken, CancellationTokenSource combinedTokenSource)
   at System.Collections.Concurrent.BlockingCollection`1.TryTake(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Collections.Concurrent.BlockingCollection`1.Take()
   at Orleans.Runtime.Messaging.InboundMessageQueue.WaitMessage(Categories type)
   at Orleans.Runtime.Messaging.MessageCenter.WaitMessage(Categories type, CancellationToken ct)
   at Orleans.Runtime.Messaging.IncomingMessageAgent.Run()
   at Orleans.Threading.FiltersApplicant`1.Apply(T action)

and

Runtime.Messaging.IncomingMessageAgent/System[100026]
      Asynch agent Runtime.Messaging.IncomingMessageAgent/System encountered unexpected exception The Stage will be restarted.
System.ArgumentNullException: Value cannot be null.
   at System.Threading.Monitor.Exit(Object obj)
   at System.Threading.SemaphoreSlim.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Collections.Concurrent.BlockingCollection`1.TryTakeWithNoTimeValidation(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken, CancellationTokenSource combinedTokenSource)
   at System.Collections.Concurrent.BlockingCollection`1.TryTake(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Collections.Concurrent.BlockingCollection`1.Take()
   at Orleans.Runtime.Messaging.InboundMessageQueue.WaitMessage(Categories type)
   at Orleans.Runtime.Messaging.MessageCenter.WaitMessage(Categories type, CancellationToken ct)
   at Orleans.Runtime.Messaging.IncomingMessageAgent.Run()
   at Orleans.Threading.FiltersApplicant`1.Apply(T action)

Less important

Orleans.Runtime.Silo[100325]
      Ignoring System.ObjectDisposedException exception thrown from an action called by Silo.Stop.
System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'IServiceProvider'.
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.ThrowHelper.ThrowObjectDisposedException()
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope.GetService(Type serviceType)
   at Microsoft.Extensions.DependencyInjection.ActivatorUtilities.ConstructorMatcher.CreateInstance(IServiceProvider provider)
   at Microsoft.Extensions.DependencyInjection.ActivatorUtilities.CreateInstance[T](IServiceProvider provider, Object[] parameters)
   at Orleans.Runtime.AsynchAgent.EnsureExecutorInitialized()
   at Orleans.Runtime.AsynchAgent.Start()
   at Orleans.Runtime.Messaging.OutboundMessageQueue.<>c__DisplayClass9_1.<.ctor>b__0()
   at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode)
   at System.Lazy`1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor)
   at System.Lazy`1.CreateValue()
   at Orleans.Runtime.Messaging.OutboundMessageQueue.Dispose()
   at Orleans.Runtime.Messaging.MessageCenter.Dispose()
   at Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope.Dispose()
   at Orleans.Runtime.Utils.SafeExecute(Action action, ILogger logger, Func`1 callerGetter)

the silo is deployed to the Kubernetus v1.10.6-gke.2 with 3 replicas as a StatefulSet using AdoNet clustering provider Orleans v2.1.0-beta1 dotnet sdk v2.1.401 dotnet runtime v2.1.3

Looks like there is something bad happened in the corefx blocking collection. Regarding ObjectDisposedException it really minor but a bit annoying because why the silo is trying to dispose the service provider again? Does anyone know anything about this?

xiazen commented 6 years ago

hmm I wonder how much of the blocking collection issue is related to https://github.com/dotnet/corefx/issues/30781 . They say they fixed that issue in 2.1.5 release. But they never released 2.1.5 based on their website.. hmm

I say the issue looks related because, based on our InboundMessageQueue.WaitMessage code, it would catch InvalidOperationException and return a null if this.messageQueues[(int)type].Take(); throws a InvalidOperationException , which sounds like the path your exception log is hitting. And this is very similar to the issue described in https://github.com/dotnet/corefx/issues/30781

Maybe upgrade to corefx 2.2 would fix this issue.

xiazen commented 6 years ago

Regarding ObjectDisposedException it really minor but a bit annoying because why the silo is trying to dispose the service provider again?

I never see this before in our tests. Maybe it has something to do with running inside Kubernetus . Do you have a repro of this?

ReubenBond commented 6 years ago

The ObjectDisposedException should be much less common on 2.1.0. It occurs because worker threads are waiting for work, but the queue they wait on is disposed from under them. In 2.1.0 we have a new scheduler which doesn't operate in the same way and so the exceptions should not occur.

I believe we had some code to avoid those exceptions by pumping the queue once after cancellation, but I cannot see it anymore.

patricksuo commented 6 years ago

FYI: https://github.com/dotnet/corefx/issues/14320

patricksuo commented 6 years ago

I can reproduce ArgumentNullException on my local machine.

I propose either 1) suppressing this kind of exception or 2) canceling the blocking thread before destroying in-used objects (e.g. message center ->inbound/outbound queue -> BlockingCollection -> SemaphoreSlim ).

These strange exceptions scary newbie (like me).