Closed freddyrios closed 2 years ago
Roman has stated this was already fixed in a later release to the one we used, so this being a low reproduction issue, the work here is to try upgrading to the latest version (4.4.1) keeping an eye on cluster stability on some of the failure scenarios we have tried in the past.
Our latest internal release is now using 4.4.1.
Although we don't have reproduction steps to validate this is fully gone, input from roman + code review makes this likely to be fixed now.
We'll roll out the release and follow up if we every run into these symptoms again.
This happened on a test system that is not a proper/full cluster (2 nodes), so any one failed node takes down the cluster (which is ok for this test system).
There are 2 problems:
This exception below is from A.log captured via AppDomain.CurrentDomain.UnhandledException
raft.log (1hr time shift from above, so same time):
C.log (just another of our logs with events just before the crash):
node 2 C.log (same as above
node2 raft.log: