Two clusters have the same clusterId are not connecting to create one large cluster.

rmn388 commented 5 years ago

I had a cluster of 77 nodes, I took 71 of them away to a different location, at this time there were two separate clusters, one of 71 nodes and one of 6 nodes.

When I brought them back, with all nodes in the same room, the two clusters did not join into one big one. To investigate I turned on a sinknode, by resetting it, it would connect to one cluster or the other and I could get some info.

status info from sinknode when connected to the smaller cluster:

Mesh clusterSize:7, clusterId:4163325473
Enrolled 1: networkId:40002, deviceType:3, NetKey E1:31:....:E5:1E

status info from sinknode when connected to the larger cluster:

Mesh clusterSize:72, clusterId:4163325473
Enrolled 1: networkId:40002, deviceType:3, NetKey E1:31:....:E5:1E

The problem seems to be that the clusterId of both networks are the same so they won't connect, even though they are not part of the same cluster and have a different clusterSize. I have observed behavior like this in the past but it has been elusive and I hadn't been able to gather data like this.

Potential solutions

If my understanding is correct, this could be addressed on two levels, one error and one failsafe:

Error: A cluster breaking up resulting in two clusters of the same ID, to my understanding this should not be possible the smaller one should always generate a new ID.
Failsafe: If the above does occur, logic could be added to recognize this. If a node sees another with the same clusterId but a larger clusterSize it could disconnect from its cluster. The one tricky part about this is that this condition could be met during normal operation when a cluster size changes but the new size is still propagating to all the nodes in the cluster, or you have old join_me packets with outdated clusterSizes.

rmn388 commented 5 years ago

Here is the debug output the with the DISCOVERY tag enabled

{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10044, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:2, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10063, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:3, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1269,"message":"Updated old buffer packet"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10020, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:2, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10058, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:3, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:20003, clusterId:f8274e21, clusterSize:6, freeIn:0, freeOut:1, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10022, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:1, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:20001, clusterId:f8274e21, clusterSize:6, freeIn:0, freeOut:3, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10052, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:2, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10037, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:0, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}

Showing join_me packets from the two clusters that have the same clusterId but different cluserSizes of 72 and 6

mariusheil commented 5 years ago

Hi,

thanks for the detailed analysis. This is a known issue with the version that we currently have on Github. We used our Simulator and did tons of simulations and found some coding bugs in the meantime. This has been solved and our Simulator was not able to produce this behaviour anymore, but due to my parental leave, it took longer than expected to publish the new version on GitHub. I cannot promise, but I should be able to publish our latest stable release in the next few days to GitHub.

For your analysis:

You are correct, but previous coding bugs could lead to race conditions. These should have been solved. We have a nightly clustering test where our simulator will simulate a few hundred clusterings with up to 200 nodes, node resets and more. So far, we have not found any issue.
This would indeed be a failsafe mechanism we haven't yet implemented. We would probably implement something using a keep alive message that we send through our mesh and that we use to count nodes. This in on the todo list but not prioritized very high as we have not found bugs anymore with our latest stable version.

mariusheil commented 5 years ago

Hi, the new version 0.8.540 was just released. Give it a try and see if the issue persists. If yes, we must investigate as we have not seen this issue internally in our tests and our simulator was also not able to find a bug.

Marius

rmn388 commented 5 years ago

Thanks, I look forward to trying that out. In the meantime, I implemented failsafe where the nodes check to see if there is a stable cluster with the same ClusterId but larger ClusterSize, and disconnect from it's smaller cluster if the condition is true.

mariusheil commented 3 years ago

Closing this as it was inactive for some time, feel free to reopen.

bluerange-io / bluerange-mesh

Two clusters have the same clusterId are not connecting to create one large cluster. #114