Grains with an identity of String.Empty cannot await calls to grains in other Silos

mcallim90 commented 3 months ago

I'm using Orleans 8.1.0 and I'm encountering an error which is only present in our Production stage which contains two silos in its cluster. Currently I make a request to a generic catalog like grain (currently implemented as a stateless worker) which then activates and calls specific data grain of the same generic type. From logs it appears that the processing of the request proceeds as normal through both grains until its time to return the result from the second grain.

If the second grain was activated in same silo as the first everything succeeds and the whole chain of requests takes about half a second but if the second grain was activated in a different silo then the request times out seemingly as if there is some kind of deadlock.

Received status update for pending request, Request: "Request [S10.0.236.220:11111:80076501 sys.client/hosted-10.0.236.220:11111@80076501]->[S10.0.236.220:11111:80076501 entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/] Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>[(Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>)Insights.Grains.EntityIndexGrain`1[Insights.TestInsights.Model.TestProject]].GetEntityAsync(TestProject1) #4579". Status: "[Activation: S10.0.236.220:11111:80076501/entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/@de41ced420ad4d92a03e2943f5ba7154 #GrainType=Insights.Grains.EntityIndexGrain`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]],Insights.Grains Placement=StatelessWorkerPlacement State=Valid NonReentrancyQueueSize=0 NumRunning=1 IdlenessTimeSpan=00:00:00 CollectionAgeLimit=00:15:00]
TaskScheduler status: WorkItemGroup:Name=[Activation: S10.0.236.220:11111:80076501/entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/@de41ced420ad4d92a03e2943f5ba7154#GrainType=Insights.Grains.EntityIndexGrain`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]],Insights.Grains Placement=StatelessWorkerPlacement State=Valid],WorkGroupStatus=Waiting. Currently QueuedWorkItems=0; Total Enqueued=4; Total processed=4; TaskRunner=ActivationTaskScheduler-56:Queued=0; Detailed context=<[Activation: S10.0.236.220:11111:80076501/entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/@de41ced420ad4d92a03e2943f5ba7154 #GrainType=Insights.Grains.EntityIndexGrain`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]],Insights.Grains Placement=StatelessWorkerPlacement State=Valid NonReentrancyQueueSize=0 NumRunning=1 IdlenessTimeSpan=00:00:00 CollectionAgeLimit=00:15:00 CurrentlyExecuting=Request [S10.0.236.220:11111:80076501 sys.client/hosted-10.0.236.220:11111@80076501]->[S10.0.236.220:11111:80076501 entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/] Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>[(Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>)Insights.Grains.EntityIndexGrain`1[Insights.TestInsights.Model.TestProject]].GetEntityAsync(TestProject1) #4579]>
Message Request [S10.0.236.220:11111:80076501 sys.client/hosted-10.0.236.220:11111@80076501]->[S10.0.236.220:11111:80076501 entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/] Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>[(Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>)Insights.Grains.EntityIndexGrain`1[Insights.TestInsights.Model.TestProject]].GetEntityAsync(TestProject1) #4579 was enqueued 00:00:12.6000000 ago and has now been executing for 00:00:12.6000000."

Unhandled exception, trace 9fe8b805-c85e-43a2-9b3c-1ec4a8500fc8
System.TimeoutException: Response did not arrive on time in 00:00:30 for message: Request [S10.0.236.220:11111:80076501 sys.client/hosted-10.0.236.220:11111@80076501]->[S10.0.236.220:11111:80076501 entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/] Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>[(Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>)Insights.Grains.EntityIndexGrain`1[Insights.TestInsights.Model.TestProject]].GetEntityAsync(TestProject1) #4579. Last known status is IsExecuting: True, IsWaiting: False, Diagnostics: [[Activation: S10.0.236.220:11111:80076501/entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/@de41ced420ad4d92a03e2943f5ba7154 #GrainType=Insights.Grains.EntityIndexGrain`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]],Insights.Grains Placement=StatelessWorkerPlacement State=Valid NonReentrancyQueueSize=0 NumRunning=1 IdlenessTimeSpan=00:00:00 CollectionAgeLimit=00:15:00], TaskScheduler status: WorkItemGroup:Name=[Activation: S10.0.236.220:11111:80076501/entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/@de41ced420ad4d92a03e2943f5ba7154#GrainType=Insights.Grains.EntityIndexGrain`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]],Insights.Grains Placement=StatelessWorkerPlacement State=Valid],WorkGroupStatus=Waiting. Currently QueuedWorkItems=0; Total Enqueued=4; Total processed=4; TaskRunner=ActivationTaskScheduler-56:Queued=0; Detailed context=<[Activation: S10.0.236.220:11111:80076501/entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/@de41ced420ad4d92a03e2943f5ba7154 #GrainType=Insights.Grains.EntityIndexGrain`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]],Insights.Grains Placement=StatelessWorkerPlacement State=Valid NonReentrancyQueueSize=0 NumRunning=1 IdlenessTimeSpan=00:00:00 CollectionAgeLimit=00:15:00 CurrentlyExecuting=Request [S10.0.236.220:11111:80076501 sys.client/hosted-10.0.236.220:11111@80076501]->[S10.0.236.220:11111:80076501 entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/] Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>[(Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>)Insights.Grains.EntityIndexGrain`1[Insights.TestInsights.Model.TestProject]].GetEntityAsync(TestProject1) #4579]>, Message Request [S10.0.236.220:11111:80076501 sys.client/hosted-10.0.236.220:11111@80076501]->[S10.0.236.220:11111:80076501 entityindex`1[[Insights.TestInsights.Model.TestProject,Insights.TestInsights.Model]]/] Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>[(Insights.Grains.Interfaces.IEntityIndexGrain<TEntity>)Insights.Grains.EntityIndexGrain`1[Insights.TestInsights.Model.TestProject]].GetEntityAsync(TestProject1) #4579 was enqueued 00:00:27.6000000 ago and has now been executing for 00:00:27.6000000.]. 
   at Orleans.Serialization.Invocation.ResponseCompletionSource`1.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 230
   at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
--- End of stack trace from previous location ---
   at Insights.TestInsights.SiloExtensions.TestInsightsSiloExtensions.GetEntityAsync[TEntity](String entityIdentifier, IGrainFactory grainFactory, ILogger`1 logger) in /app/src/TestInsights/Insights.TestInsights.SiloExtensions/TestInsightsSiloExtensions.cs:line 446
   at Microsoft.AspNetCore.Http.RequestDelegateFactory.ExecuteTaskResult[T](Task`1 task, HttpContext httpContext)
   at Microsoft.AspNetCore.Routing.EndpointMiddleware.<Invoke>g__AwaitRequestTask|7_0(Endpoint endpoint, Task requestTask, ILogger logger)

I would assume that if there is no deadlock in within a single silo there would not be in multiple silos. Could this be a silent serialization failure? Not sure how to proceed.

mcallim90 commented 3 months ago

I think I can safely rule out a silent serialization error as the issue. I changed the interface of all my grains to return strings rather than records with the [GenerateSerializer] attribute. I serialized and deserialized the results using NewtonSoft instead and the issue persists. The grain behavior is correct locally or in a single silo cluster but broken in a multi silo cluster.

Another interesting note is that the first grain A on silo A trying to await a response from grain B on silo B is able to get things like exceptions just fine so I don't believe it is a networking issue and rather I think it must be related to the grain registry.

mcallim90 commented 3 months ago

My team managed to track down the key elements of this failure.

Given the following test code,

using Orleans.Runtime;

namespace Insights.Grains;

public sealed class TestGrain(
  IGrainContext grainContext,
  ILogger<TestGrain> logger)
  : ITestGrain
{

  private string? data;

  [Alias("PutTestDataAsync")]
  public Task PutTestDataAsync(string data)
  {
      logger.LogError("{grainType} {typeName} putting data {data}",
          this.GetType().Name,
          nameof(String),
          data);
      this.data = data;
      return Task.CompletedTask;
  }

  [Alias("PutTestDataAsync")]
  public Task<string?> GetTestDataAsync()
  {
      logger.LogError("{grainType} {typeName} returning data {data}",
          this.GetType().Name,
          nameof(String),
          this.data);
      return Task.FromResult(this.data);
  }
}

 using Microsoft.Extensions.DependencyInjection;
 using Microsoft.Extensions.Logging;

 using Orleans.Concurrency;
 using Orleans.Runtime;

 namespace Insights.Grains;

 [Reentrant]
 [StatelessWorker]
 public sealed class TestIndexGrain(IGrainContext grainContext,
     ILogger<TestIndexGrain> logger)
     : ITestIndexGrain
 {
     public async Task<List<string>> ListTestDataAsync()
     {
         logger.LogError("{grainType} {typeName} listing data.", this.GetType().Name, nameof(String));
         List<string> result = [];
         for (int id = 0; id <= 10; id++)
         {
             logger.LogError("Getting grain with id {id}", id);
             IGrainFactory grainFactory = grainContext.ActivationServices.GetRequiredService<IGrainFactory>();
             string? data = await grainFactory.GetGrain<ITestGrain>(id.ToString()).GetTestDataAsync();
             if (data != null)
             {
                 logger.LogError("Found data {data}", data);
                 result.Add(data);
             }
         }
         logger.LogError("Returning list {result}", result);
         return await Task.FromResult(result);
     }}

 namespace Insights.Grains.Interfaces;

[Alias("ITestIndexGrain")]
public interface ITestIndexGrain: IGrainWithStringKey
{
  [Alias("ListTestDataAsync")]
  Task<List<string>> ListTestDataAsync();
}

namespace Insights.Grains.Interfaces;

[Alias("ITestGrain")]
public interface ITestGrain: IGrainWithStringKey
{
  [Alias("PutTestDataAsync")]
  Task PutTestDataAsync(string data);

  [Alias("GetTestDataAsync")]
  Task<string?> GetTestDataAsync();
}

Whenever we call the list operation in a multi-silo cluster the operation fails clearly at the respond phase when the index grain is awaiting a response from another silo.

builder.MapPut("/api/testgrain/{id}", [Authorize(Policy = Authorization.Policy.Admin)] async (IGrainFactory grainFactory, [FromRoute] string id, [FromBody] string data) =>
{
    ITestGrain testGrain = grainFactory.GetGrain<ITestGrain>(id);
    await testGrain.PutTestDataAsync(data);
});

builder.MapGet("/api/testgrain/{id}", [Authorize(Policy = Authorization.Policy.Admin)] async (IGrainFactory grainFactory, [FromRoute] string id) =>
{
    ITestGrain testGrain = grainFactory.GetGrain<ITestGrain>(id);
    return await testGrain.GetTestDataAsync();
});

builder.MapGet("/api/testgrains", [Authorize(Policy = Authorization.Policy.Admin)] async (IGrainFactory grainFactory) =>
{
    ITestIndexGrain testGrain = grainFactory.GetGrain<ITestIndexGrain>(string.Empty);
    return await testGrain.ListTestDataAsync();
});

This failure is avoided when instead of string.empty we declare the index as a GrainWithGuidKey and use guid.empty. I believe this a legitimate bug or undocumented limitation of the Orleans framework

ReubenBond commented 3 months ago

Orleans should throw in this case Thanks for tracking it down. EDIT: if you're willing, a PR would be welcome.

ledjon-behluli commented 2 months ago

this has bit me in the past too

dotnet / orleans

Grains with an identity of String.Empty cannot await calls to grains in other Silos #9070