Closed mcallim90 closed 2 months ago
I think I can safely rule out a silent serialization error as the issue. I changed the interface of all my grains to return strings rather than records with the [GenerateSerializer] attribute. I serialized and deserialized the results using NewtonSoft instead and the issue persists. The grain behavior is correct locally or in a single silo cluster but broken in a multi silo cluster.
Another interesting note is that the first grain A on silo A trying to await a response from grain B on silo B is able to get things like exceptions just fine so I don't believe it is a networking issue and rather I think it must be related to the grain registry.
My team managed to track down the key elements of this failure.
Given the following test code,
using Orleans.Runtime;
namespace Insights.Grains;
public sealed class TestGrain(
IGrainContext grainContext,
ILogger<TestGrain> logger)
: ITestGrain
{
private string? data;
[Alias("PutTestDataAsync")]
public Task PutTestDataAsync(string data)
{
logger.LogError("{grainType} {typeName} putting data {data}",
this.GetType().Name,
nameof(String),
data);
this.data = data;
return Task.CompletedTask;
}
[Alias("PutTestDataAsync")]
public Task<string?> GetTestDataAsync()
{
logger.LogError("{grainType} {typeName} returning data {data}",
this.GetType().Name,
nameof(String),
this.data);
return Task.FromResult(this.data);
}
}
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using Orleans.Concurrency;
using Orleans.Runtime;
namespace Insights.Grains;
[Reentrant]
[StatelessWorker]
public sealed class TestIndexGrain(IGrainContext grainContext,
ILogger<TestIndexGrain> logger)
: ITestIndexGrain
{
public async Task<List<string>> ListTestDataAsync()
{
logger.LogError("{grainType} {typeName} listing data.", this.GetType().Name, nameof(String));
List<string> result = [];
for (int id = 0; id <= 10; id++)
{
logger.LogError("Getting grain with id {id}", id);
IGrainFactory grainFactory = grainContext.ActivationServices.GetRequiredService<IGrainFactory>();
string? data = await grainFactory.GetGrain<ITestGrain>(id.ToString()).GetTestDataAsync();
if (data != null)
{
logger.LogError("Found data {data}", data);
result.Add(data);
}
}
logger.LogError("Returning list {result}", result);
return await Task.FromResult(result);
}}
namespace Insights.Grains.Interfaces;
[Alias("ITestIndexGrain")]
public interface ITestIndexGrain: IGrainWithStringKey
{
[Alias("ListTestDataAsync")]
Task<List<string>> ListTestDataAsync();
}
namespace Insights.Grains.Interfaces;
[Alias("ITestGrain")]
public interface ITestGrain: IGrainWithStringKey
{
[Alias("PutTestDataAsync")]
Task PutTestDataAsync(string data);
[Alias("GetTestDataAsync")]
Task<string?> GetTestDataAsync();
}
Whenever we call the list operation in a multi-silo cluster the operation fails clearly at the respond phase when the index grain is awaiting a response from another silo.
builder.MapPut("/api/testgrain/{id}", [Authorize(Policy = Authorization.Policy.Admin)] async (IGrainFactory grainFactory, [FromRoute] string id, [FromBody] string data) =>
{
ITestGrain testGrain = grainFactory.GetGrain<ITestGrain>(id);
await testGrain.PutTestDataAsync(data);
});
builder.MapGet("/api/testgrain/{id}", [Authorize(Policy = Authorization.Policy.Admin)] async (IGrainFactory grainFactory, [FromRoute] string id) =>
{
ITestGrain testGrain = grainFactory.GetGrain<ITestGrain>(id);
return await testGrain.GetTestDataAsync();
});
builder.MapGet("/api/testgrains", [Authorize(Policy = Authorization.Policy.Admin)] async (IGrainFactory grainFactory) =>
{
ITestIndexGrain testGrain = grainFactory.GetGrain<ITestIndexGrain>(string.Empty);
return await testGrain.ListTestDataAsync();
});
This failure is avoided when instead of string.empty we declare the index as a GrainWithGuidKey and use guid.empty. I believe this a legitimate bug or undocumented limitation of the Orleans framework
Orleans should throw in this case Thanks for tracking it down. EDIT: if you're willing, a PR would be welcome.
this has bit me in the past too
I'm using Orleans 8.1.0 and I'm encountering an error which is only present in our Production stage which contains two silos in its cluster. Currently I make a request to a generic catalog like grain (currently implemented as a stateless worker) which then activates and calls specific data grain of the same generic type. From logs it appears that the processing of the request proceeds as normal through both grains until its time to return the result from the second grain.
If the second grain was activated in same silo as the first everything succeeds and the whole chain of requests takes about half a second but if the second grain was activated in a different silo then the request times out seemingly as if there is some kind of deadlock.
I would assume that if there is no deadlock in within a single silo there would not be in multiple silos. Could this be a silent serialization failure? Not sure how to proceed.