Open jan-johansson-mr opened 1 year ago
Further exploring the issue by waiting with the test client to connect to the cluster (with the silos), about 30 seconds, makes the lookup issue (of the grain with the factory) go away. That's good. But the System.IO.FileNotFoundException: Could not load file or assembly
is still there from time to time.
It seems that the grain request hits the wrong Silo...
The factory pulls a reference to ITestGrain002
(on Silo B) and invokes a call, and the call hits the Silo with ITestGrain001
(Silo A) instead, so the Silo (A) can't load the assembly for ITestGrain002
. A repeat of the call (since the invocation in my test project is within a try-catch-repeat cycle) , with a renewed factory GetGrain<ITestGrain002>(...)
now hits the correct Silo (B), and the issue goes away.
If I do not renew the grain reference, with the factory (e.g., not using GetGrain
), then the call will continue to hit the wrong Silo.
So, in my test bench
Silo | Grain | Grain |
---|---|---|
A | 001 | |
B | 002 |
Sometimes, in startup phase, a GetGrain<ITestGrain002>(...)
gets a reference that hits Silo A. Renewal of the grain reference (e.g., a new GetGrain<ITestGrain002>(...)
) produces a reference that properly hits Silo B. The renewal is done in a try-catch-repeat cycle.
The setup is a Heterogeneous cluster. That is, in Silo A we have GrainTest001 and in Silo B, we have GrainTest002.
Hi @adityamandaleeka, I've pursued the issue further.
It turns out that there seems to be a problem with heterogeneous cluster (as described above).
I've developed support for multi-cluster client with my test-bench, that is, I can pull a cluster client pointing to one cluster, while another client point to another cluster. Creating homogenous clusters with this strategy turns out to give zero issues, so the lookup is good in the homogenous cluster (silos) case.
Another observation with the heterogenous cluster setup is that when the lookup fails (as described above), the failure will be persistent until I change the ID of the grain, or in summarized form:
Thanks
I've seen this issue recently - has there been any progress on this investigation @ReubenBond ?
@insylogo are you able to provide any more info? The bug doesn't make sense to be so far.
Hi,
I'm using a fairly simple configuration, with docker images and Consul as membership provider for my setup. The configuration is with 2 silos, one grain per silo (as a test bench). The issue, that I have not observed in earlier versions of Orleans, is that there is always a startup problem, when I run the setup in Visual Studio as a debug session.
One or the other grain can't be resolved by the grain factory, giving the following message
Could not find an implementation for interface
E.g., in one startup we have the message
Could not find an implementation for interface ITestGrain001
And then in another startup
Could not find an implementation for interface ITestGrain002
It's random.
I've put the factory (resolving the grain) into a try-catch loop, and then clocked in that loop how long time it takes for the factory to be able to resolve the grain, and it takes one minute. Then the error goes away and mostly everything works fine.
But sometimes it gets even worse.
We have another issue, once in awhile when the factory successfully resolves the grain (after the one minute retry):
System.IO.FileNotFoundException: Could not load file or assembly
This exception targets the grain implementation, for the grain interface.
I've also put the invocation to the grain into a try-catch loop, and it turns out that after the first try and fail (loading the assembly), the second has always succeeded.
After doing these retrials, and everything seems loaded, the setup runs fine. Until I bring everything down, and up again, and then we're back to this behavior.
Here is the stack trace when the interface can't be resolved (in this case for ITestGrain001):
Here is the stack trace when the assembly can't load:
Also, the documentation about using Consul as a membership provider is sadly outdated. There are changes made to how to use Consul in the code, but this is not reflected at all by the documentation.
Thanks