dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.04k stars 2.02k forks source link

I have a weird issue with Orleans, that I didn't have before, using Consul as membership provider #8323

Open jan-johansson-mr opened 1 year ago

jan-johansson-mr commented 1 year ago

Hi,

I'm using a fairly simple configuration, with docker images and Consul as membership provider for my setup. The configuration is with 2 silos, one grain per silo (as a test bench). The issue, that I have not observed in earlier versions of Orleans, is that there is always a startup problem, when I run the setup in Visual Studio as a debug session.

One or the other grain can't be resolved by the grain factory, giving the following message

Could not find an implementation for interface

E.g., in one startup we have the message

Could not find an implementation for interface ITestGrain001

And then in another startup

Could not find an implementation for interface ITestGrain002

It's random.

I've put the factory (resolving the grain) into a try-catch loop, and then clocked in that loop how long time it takes for the factory to be able to resolve the grain, and it takes one minute. Then the error goes away and mostly everything works fine.

But sometimes it gets even worse.

We have another issue, once in awhile when the factory successfully resolves the grain (after the one minute retry):

System.IO.FileNotFoundException: Could not load file or assembly

This exception targets the grain implementation, for the grain interface.

I've also put the invocation to the grain into a try-catch loop, and it turns out that after the first try and fail (loading the assembly), the second has always succeeded.

After doing these retrials, and everything seems loaded, the setup runs fine. Until I bring everything down, and up again, and then we're back to this behavior.

Here is the stack trace when the interface can't be resolved (in this case for ITestGrain001):

  System.ArgumentException: Could not find an implementation for interface TestBench.ITestGrain001
     at Orleans.GrainInterfaceTypeToGrainTypeResolver.GetGrainType(GrainInterfaceType interfaceType) in /_/src/Orleans.Core/Core/GrainInterfaceTypeToGrainTypeResolver.cs:line 102
     at Orleans.GrainFactory.GetGrain[TGrainInterface](Guid primaryKey, String grainClassNamePrefix) in /_/src/Orleans.Core/Core/GrainFactory.cs:line 45

Here is the stack trace when the assembly can't load:

     at System.Reflection.RuntimeAssembly.<InternalLoad>g____PInvoke|47_0(NativeAssemblyNameParts* pAssemblyNameParts, ObjectHandleOnStack requestingAssembly, StackCrawlMarkHandle stackMark, Int32 throwOnFileNotFound, ObjectHandleOnStack assemblyLoadContext, ObjectHandleOnStack retAssembly)
     at System.Reflection.RuntimeAssembly.InternalLoad(AssemblyName assemblyName, StackCrawlMark& stackMark, AssemblyLoadContext assemblyLoadContext, RuntimeAssembly requestingAssembly, Boolean throwOnFileNotFound)
     at System.Reflection.Assembly.Load(AssemblyName assemblyRef)
     at Orleans.Serialization.TypeSystem.CachedTypeResolver.<TryPerformUncachedTypeResolution>g__ResolveAssembly|9_0(AssemblyName assemblyName) in /_/src/Orleans.Serialization/TypeSystem/CachedTypeResolver.cs:line 141
     at System.TypeNameParser.ResolveAssembly(String asmName, Func`2 assemblyResolver, Boolean throwOnError, StackCrawlMark& stackMark)
     at System.TypeNameParser.ConstructType(Func`2 assemblyResolver, Func`4 typeResolver, Boolean throwOnError, Boolean ignoreCase, StackCrawlMark& stackMark)
     at System.TypeNameParser.GetType(String typeName, Func`2 assemblyResolver, Func`4 typeResolver, Boolean throwOnError, Boolean ignoreCase, StackCrawlMark& stackMark)
     at System.Type.GetType(String typeName, Func`2 assemblyResolver, Func`4 typeResolver, Boolean throwOnError)
     at Orleans.Serialization.TypeSystem.CachedTypeResolver.TryPerformUncachedTypeResolution(String fullName, Type& type, Assembly[] assemblies) in /_/src/Orleans.Serialization/TypeSystem/CachedTypeResolver.cs:line 113
     at Orleans.Serialization.TypeSystem.CachedTypeResolver.TryPerformUncachedTypeResolution(String name, Type& type) in /_/src/Orleans.Serialization/TypeSystem/CachedTypeResolver.cs:line 57
     at Orleans.Serialization.TypeSystem.CachedTypeResolver.TryResolveType(String name, Type& type) in /_/src/Orleans.Serialization/TypeSystem/CachedTypeResolver.cs:line 45
     at Orleans.Serialization.TypeSystem.TypeConverter.ParseInternal(TypeSpec parsed, Type& type) in /_/src/Orleans.Serialization/TypeSystem/TypeConverter.cs:line 332
     at Orleans.Serialization.TypeSystem.TypeConverter.ResolveCompoundAliasType[TState](TupleTypeSpec input, TState& state) in /_/src/Orleans.Serialization/TypeSystem/TypeConverter.cs:line 555
     at Orleans.Serialization.TypeSystem.RuntimeTypeNameRewriter.TypeRewriter`1.HandleCompoundType(TupleTypeSpec type, String assemblyName) in /_/src/Orleans.Serialization/TypeSystem/RuntimeTypeNameRewriter.cs:line 233
     at Orleans.Serialization.TypeSystem.RuntimeTypeNameRewriter.TypeRewriter`1.ApplyInner(TypeSpec input, String assemblyName) in /_/src/Orleans.Serialization/TypeSystem/RuntimeTypeNameRewriter.cs:line 77
     at Orleans.Serialization.TypeSystem.RuntimeTypeNameRewriter.TypeRewriter`1.Rewrite(TypeSpec input) in /_/src/Orleans.Serialization/TypeSystem/RuntimeTypeNameRewriter.cs:line 72
     at Orleans.Serialization.TypeSystem.RuntimeTypeNameRewriter.Rewrite[TState](TypeSpec input, Rewriter`1 rewriter, CompoundAliasResolver`1 compoundAliasRewriter, TState& state) in /_/src/Orleans.Serialization/TypeSystem/RuntimeTypeNameRewriter.cs:line 45
     at Orleans.Serialization.TypeSystem.TypeConverter.ParseInternal(TypeSpec parsed, Type& type) in /_/src/Orleans.Serialization/TypeSystem/TypeConverter.cs:line 316
     at Orleans.Serialization.TypeSystem.TypeCodec.TryRead[TInput](Reader`1& reader) in /_/src/Orleans.Serialization/TypeSystem/TypeCodec.cs:line 72
     at Orleans.Serialization.Codecs.FieldHeaderCodec.ReadType[TInput](Reader`1& reader, SchemaType schemaType) in /_/src/Orleans.Serialization/Codecs/FieldHeaderCodec.cs:line 187
     at Orleans.Serialization.Codecs.FieldHeaderCodec.ReadExtendedFieldHeader[TInput](Reader`1& reader, Field& field) in /_/src/Orleans.Serialization/Codecs/FieldHeaderCodec.cs:line 169
     at Orleans.Runtime.Messaging.MessageSerializer.ReadBodyObject[TInput](Message message, Reader`1& reader) in /_/src/Orleans.Core/Messaging/MessageSerializer.cs:line 130
     at Orleans.Runtime.Messaging.MessageSerializer.TryRead(ReadOnlySequence`1& input, Message& message) in /_/src/Orleans.Core/Messaging/MessageSerializer.cs:line 126
     at Orleans.Runtime.Messaging.Connection.ProcessIncoming() in /_/src/Orleans.Core/Networking/Connection.cs:line 346

Also, the documentation about using Consul as a membership provider is sadly outdated. There are changes made to how to use Consul in the code, but this is not reflected at all by the documentation.

Thanks

jan-johansson-mr commented 1 year ago

Further exploring the issue by waiting with the test client to connect to the cluster (with the silos), about 30 seconds, makes the lookup issue (of the grain with the factory) go away. That's good. But the System.IO.FileNotFoundException: Could not load file or assembly is still there from time to time.

jan-johansson-mr commented 1 year ago

It seems that the grain request hits the wrong Silo...

The factory pulls a reference to ITestGrain002 (on Silo B) and invokes a call, and the call hits the Silo with ITestGrain001 (Silo A) instead, so the Silo (A) can't load the assembly for ITestGrain002. A repeat of the call (since the invocation in my test project is within a try-catch-repeat cycle) , with a renewed factory GetGrain<ITestGrain002>(...) now hits the correct Silo (B), and the issue goes away.

If I do not renew the grain reference, with the factory (e.g., not using GetGrain), then the call will continue to hit the wrong Silo.

So, in my test bench

Silo Grain Grain
A 001
B 002

Sometimes, in startup phase, a GetGrain<ITestGrain002>(...) gets a reference that hits Silo A. Renewal of the grain reference (e.g., a new GetGrain<ITestGrain002>(...)) produces a reference that properly hits Silo B. The renewal is done in a try-catch-repeat cycle.

The setup is a Heterogeneous cluster. That is, in Silo A we have GrainTest001 and in Silo B, we have GrainTest002.

jan-johansson-mr commented 1 year ago

Hi @adityamandaleeka, I've pursued the issue further.

It turns out that there seems to be a problem with heterogeneous cluster (as described above).

I've developed support for multi-cluster client with my test-bench, that is, I can pull a cluster client pointing to one cluster, while another client point to another cluster. Creating homogenous clusters with this strategy turns out to give zero issues, so the lookup is good in the homogenous cluster (silos) case.

Another observation with the heterogenous cluster setup is that when the lookup fails (as described above), the failure will be persistent until I change the ID of the grain, or in summarized form:

Thanks

insylogo commented 1 year ago

I've seen this issue recently - has there been any progress on this investigation @ReubenBond ?

ReubenBond commented 1 year ago

@insylogo are you able to provide any more info? The bug doesn't make sense to be so far.