Closed peter-perot closed 8 years ago
@peter-perot, we have three issues here:
.NET Runtime Platform stalled
- Orleans uses a timer to try to assert whether the underlying .net framework is stalled, a 1 second delay is considered a stall. Which sometimes means that GC is running. In your case, I'm not sure, as the warning says "We are now using total of 17MB memory. gc=273, 56, 4" (@sergeybykov might know more). And as you have noted, you haven't experienced any poor user experience. Are you running the silos on VMs ? Sometimes the clock on a VM runs super-fast or super-slow. Try to look at the logs of a silo that gets no traffic. If the .NET Runtime is actually stalled that frequently, some other evil is going on which probably doesn't stem from Orleans.ConnectionLossException
- An error that occurs when the ZK server fails to respond in a timely manner. In Orleans, I've hardcoded the timeout to 2 seconds. I could have made it configurable, but I've found that it's never the problem, just a symptom of a real problem. i.e. network issues or ZK server configuration issues. I suggest:
maxClientCnxns=0
in all ZK servers zoo.cfg
file, which disables this limit. Before doing that, you can consult the ZK servers logs at the times Orleans got the exception. If it's a server issue, you'd probably find some helpful correlated error there.ZooKeeper.LogToFile = false;
before starting the silo. Or adding the necessary credentials for the app pool to create the file. cool bug :smile:Keep me posted.
Hi @shayhatsor, thanks for the fast reply. Yes, the whole things is running on a VM, so I will ignore the stall messages for now.
Concerning Zookeeper I set maxClientCnxnx=0 in zoo.cfg, restarted Zookeeper, but it didn't help. Over the weekend several connection-loss exceptions have been logged by Orleans. Unfortunately I cannot see what Zookeeper logged at the same time since Zookeeper is logging approximately 50 messages per minute (DEBUG, INFO...), and the log has been automatically truncated. I'm no Zookeeper expert since I only use it for Orleans membership protocol.
Our setup is one server (VM) running one Zookeeper instance, 5 silos for our QA/testing environment, and 5 silos for our production environment. I.e. 10 silos are talking to Zookeeper (with 2 different deplyoment IDs of course – one silo-farm for QA, one for production).
So I don't know where the connectivity problem with Zookeeper comes from – it's all on the same VM.
Concerning Zookeeper I set maxClientCnxnx=0 in zoo.cfg, restarted Zookeeper, but it didn't help
Yep, in your setup it wouldn't. you have 10 silos talking to ZK, and the default is 10 concurrent connections per IP. so you can remove that settings.
Unfortunately I cannot see what Zookeeper logged at the same time since Zookeeper is logging approximately 50 messages per minute (DEBUG, INFO...), and the log has been automatically truncated
You can set the log to show only warning and above. it's log4net, see ZooKeeper Administrator's Guide. But I think I know what's the problem, so read on before making the change.
Our setup is one server
That's a high risk setup. If the machine goes down, production goes down. For production, one silo per machine is advisable. I've had good experience with Nomad for Orleans cluster deployment. For testing/QA, the setup is much less critical. you can run only one silo with in-memory membership, one silo with persistent membership or use the same setup as production. There's a performance penalty with having 5 silos of the same cluster on the same machine, and almost nothing is gained. Even if only one silo process crash, and not another, that will still affect other silos on the machine if the faulty silo is draining resources.
running one Zookeeper instance
That's probably the root of this problem. I advise that you move the ZK instance to a different machine. It shouldn't be contending for resources with the silos. Also, the minimum recommended ZK cluster size for production is 5 nodes, on 5 different machines (or VMs with constant resource allocation). I think that the connection between the silos and ZK just timed out on high load. Also, I've fixed the weird IIS bug you've mentioned, just update the ZooKeeperNetEx package.
@peter-perot, I almost forgot. A running Orleans cluster doesn't require 100% responsiveness from the persistent membership storage, Orleans has robust mechanisms for retrying membership operations while keeping the service available. Having said that, I believe that if you apply my previous comment suggestions, you'll get a much coherent picture of what's going on.
@shayhatsor, thanks for your advice. I will configure our servers according to your advice and think the connection loss will be gone. As you said it's no problem for Orleans if responsiveness is not 100%. And thanks for the quick fix of the NullReferenceException.
I think I have understand the membership protocol and backing data structure, but one thing seems a little bit weird to me: Is it right that Orleans will never garbage collect "dead" entries? When I start a silo, then stopping it gracfully (with CTRL+C) and restarting it again I see a "dead" entry of the last generation of that silo. When I repeat this several times I see further "dead" entries of the generations which have been shut down.
My question is: When are those dead entries garbage collected, i.e. removed from the table/database/KV-store? As far as I can see there is no storage provider (e.g. Consul or Zookeepr) which is garbage collecting on its own; and the interface for membership control does not contain any clean-up methods (except the DeleteMembershipTableEntries
method which seems never to be called since it would wipe out all data for a specific deployment-ID). Maybe an Orleans quirk?
@peter-perot, you're welcome. About your question, it's an issue that has come up several times in the past. I remember when I implemented the ZK membership provider I also noticed that there's no garbage collection. At first it did seem like a quirk, but in time I figured out that it's a log of cluster stability. Considering that the table grows very slowly, it is actually a micro-optimization.
@shayhatsor: Yes, it's giving evidence about cluster stability. If it's getting too large, something must be awfully wrong. And in this case one can clean up the tables manually.
Closing this issue now. Thanks a lot!
I have a strange issue with Zookeeper membership provider. We have different production deployments on different servers for different clients, but the issue is all the same. At some point in time (here the trouble starts at 19:31 GMT when exceptions are caught) silos start logging the following:
On the client side (a WebAPI running with Katana/Owin) the following is logged:
At first I don't know if the "stalled" warnings have something to do with this issue. I have configured garbage collection according to the manuals:
Concerning Zookeeper it has been deployed out of the box with no further configuration (only the log path has been adjusted).
It seems that these exceptions have no real impact on the stability experienced by the user unless we install our WebAPI in IIS. In this case the AppPool shuts down after some hours and the event viewer shows some weird exception from org.apache namespace:
Any idea what is going on here?