dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.07k stars 2.03k forks source link

Missing Grain Activations #5180

Closed Sebbs128 closed 5 years ago

Sebbs128 commented 5 years ago

I'm experiencing an issue where a grain is sometimes not activating, frustratingly consistently always skipping the activation that is critical to the application. I've had this issue in both 2.0.4 and 2.1.2. Clients and silos are both in the same process, and, yes, this is occurring when there's only one process running in the environment (which is a Windows Service on Windows Server 2012 R2, .NET Framework 4.6.1 with .NET Core 2.1.302, and SQL Server storage).

I'm getting and calling the method on the grain on what is basically an inherited event handler from a base class. While debugging with breakpoints, I can see the particular event come in to the handler, and go over the lines to get and call the grain. The actual type of baseEvent is some subtype deriving from BaseEventType.

protected override void ProcessEvent(BaseEventType baseEvent)
{
  // need to check if OrleansClient is initialized, as this method can be hit before the client is ready
  if (OrleansClient?.IsInitialized == true)
  {
    var grain = OrleansClient.GetGrain<IBaseEventProcessor>(baseEvent.ID); // ID is a GUID string
    grain.ProcessBaseEvent(baseEvent);
  }
}

The ProcessEvent in the grain is an implementation of the command pattern, with deduplication of the event's ID.

public interface IBaseEventProcessor : IGrainWithStringKey
{
  Task ProcessBaseEvent(BaseEventType baseEvent);
}

public class BaseEventProcessorGrain : Grain, IBaseEventProcessor
{
  private readonly IMemoryCache _memoryCache;
  private readonly IEnumerable<IBaseEventHandler> _eventHandlers;
  private readonly ILogger<BaseEventProcessorGrain> _logger;

  public BaseEventProcessorGrain(IEnumerable<IBaseEventHandler> eventHandlers, IMemoryCache memoryCache, ILogger<BaseEventProcessorGrain> logger)
  {
    _eventHandlers = eventHandlers ?? throw new ArgumentNullException(nameof(eventHandlers));
    _memoryCache = memoryCache ?? throw new ArgumentNullException(nameof(memoryCache));
    _logger = logger ?? throw new ArgumentNullException(nameof(logger));
  }

  public async Task ProcessBaseEvent(BaseEventType baseEvent)
  {
    if (_memoryCache.TryGetValue(baseEvent.ID, out object _))
    {
      _logger.LogDebug("Event {ID} has already been processed.", baseEvent.ID);
      return;
    }
    using (var cacheEntry = _memoryCache.CreateEntry(baseEvent.ID))
    {
      cacheEntry.AbsoluteExpirationRelativeToNow = TimeSpan.FromMinutes(5);
    }

    foreach (var handler in _eventHandlers.Where(h => h.CanProcess(baseEvent)))
    {
      try
      {
        await handler.HandleEvent(baseEvent);
      }
      catch (Exception ex)
      {
        _logger.LogError(ex, "Processing {HandlerName} failed.", handler.GetType().Name);
      }
    }
  }
}

I've tried marking the class as reentrant, the interface method for interleaving. I've overridden OnActivateAsync() in the grain to see if there was some error occurring during activation, but I'm consistently getting nothing in a particular situation. There will be 6 events (always the same 6, in the same order) raised via the overridden ProcessEvent(), but only the last 3 actually end up as grain activations.

To add to the intrigue/frustration, those events are all actually internally grouped into a parent container event type...

public class EventContainer
{
  public BaseEventType[] BaseEvents { get; set; }
  public DateTime Timestamp { get; set; }
  // other irrelevant properties
}

In the situation that's giving me issues, the first 3 events (including the one that is critical to the application) are in the same EventContainer, and the last 3 in their own (so, 4 EventContainers total). I tried extending my grain and interface to have a method that takes the EventContainer and passes off inside that method to the method that takes each BaseEventType, but of course I end up with no activated grain for that first EventContainer. There are other situations where I easily get more than 6 unique activations at once.

I've checked logs (I have both Microsoft.Extensions.Logging.Debug and Microsoft.Extensions.Logging.EventLog ILogger implementations added) and I'm not seeing any errors. orleans_log.txt

sergeybykov commented 5 years ago

When those events that you are saying do not cause activation of the target grain(s) are sent by the client via calling

await grain.ProcessBaseEvent(some event);

is the client getting an exception after awaiting the call?

Sebbs128 commented 5 years ago

I'm not getting the timeout activations with this. I've seen the timeouts on other (unrelated) activations, but never on this one.

sergeybykov commented 5 years ago

Does the await return successfully? If it does, the call was successfully process by the grain, which would have been successfully activated for that. If it throws any exception, not necessarily a timeout one, that should point to an issue.

Sebbs128 commented 5 years ago

I wasn't in the office yesterday, so wasn't able to retest the code until today.

First off, I wasn't awaiting the call (oops... though that wasn't affecting the majority of activations). Now that I've awaited the call, I'm seeing a SerializationException "Reference with no referred object" for these ones where an activation isn't occurring (the majority of other activations don't encounter this exception).

Only place in Orleans I can see this being thrown with that message is in DeserializationContext.FetchReferencedObject(). I'm not sure what in this particular subtype of BaseEventType might be causing that, but I'll continue looking.

Sebbs128 commented 5 years ago

I believe I've narrowed it down to references to DBNull.

Short version of why those references are there...BaseEventType has a few properties, one of which is to EventContainer (described in the OP). EventContainer has a subtype EventDataContainer which contains a list of changes in the database (represented by an object[]). When one of the elements in this array is a DBNull, I end up with the SerializationException mentioned before.

I'll try and put together a small, testable demo if you like, in case any changes want to be done to cope with DBNull. I believe I've found a way forward in my own code as well (the data changes don't matter to the grain, so I'll attempt to clone the BaseEventType instance without the EventContainer property).

sergeybykov commented 5 years ago

@Sebbs128 What do you think we should do with this issue?

Sebbs128 commented 5 years ago

Ideally, from my perspective, this would become a bug or enhancement to ensure DBNull is supported in the serialization contexts. I understand that this is likely a niche scenario though, and if it is progressed at all it may be something I would need to contribute myself.

If there isn't enough value in supporting DBNull in serialization to justify leaving this open, then we can close it.

sergeybykov commented 5 years ago

Hmm, I wonder what @ReubenBond thinks. Looks like it gets serialized fine, but is failing to deserialized. I wonder why.

ReubenBond commented 5 years ago

Does the client have the assembly containing DBNull loaded? If so, ensure that you have [assembly: KnownType(typeof(DBNull))] in the grain interfaces assembly

Sebbs128 commented 5 years ago

Both the grain interface and grain implementation assemblies know the System.DBNull type. I just tried adding [assembly: KnownType(typeof(DBNull))] on both the grain interface and implementation (ensuring it was using the attribute from Orleans.CodeGeneration and not System.Runtime.Serialization), with the same result as before.

I also tried out [assembly: GenerateSerializer(typeof(DBNull))] which causes the build to fail because DBNull is apparently not accessible.

When I have a bit more time I'll make sure all the types between BaseEventType and EventDataContainer can all be serialized by Orleans.CodeGeneration, though they're all marked Serializable already anyway.

sergeybykov commented 5 years ago

@Sebbs128 Has this been resolved?