Closed rmn388 closed 3 years ago
Here is the debug output the with the DISCOVERY tag enabled
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10044, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:2, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10063, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:3, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1269,"message":"Updated old buffer packet"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10020, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:2, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10058, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:3, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:20003, clusterId:f8274e21, clusterSize:6, freeIn:0, freeOut:1, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10022, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:1, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:20001, clusterId:f8274e21, clusterSize:6, freeIn:0, freeOut:3, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10052, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:2, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1236,"message":"JOIN_ME: sender:10037, clusterId:f8274e21, clusterSize:72, freeIn:0, freeOut:0, ack:0"}
{"type":"log","tag":"DISCOVERY","file":"Node.cpp","line":1301,"message":"Overwrote one from our own cluster"}
Showing join_me packets from the two clusters that have the same clusterId but different cluserSizes of 72 and 6
Hi,
thanks for the detailed analysis. This is a known issue with the version that we currently have on Github. We used our Simulator and did tons of simulations and found some coding bugs in the meantime. This has been solved and our Simulator was not able to produce this behaviour anymore, but due to my parental leave, it took longer than expected to publish the new version on GitHub. I cannot promise, but I should be able to publish our latest stable release in the next few days to GitHub.
For your analysis:
Hi, the new version 0.8.540 was just released. Give it a try and see if the issue persists. If yes, we must investigate as we have not seen this issue internally in our tests and our simulator was also not able to find a bug.
Marius
Thanks, I look forward to trying that out. In the meantime, I implemented failsafe where the nodes check to see if there is a stable cluster with the same ClusterId but larger ClusterSize, and disconnect from it's smaller cluster if the condition is true.
Closing this as it was inactive for some time, feel free to reopen.
I had a cluster of 77 nodes, I took 71 of them away to a different location, at this time there were two separate clusters, one of 71 nodes and one of 6 nodes.
When I brought them back, with all nodes in the same room, the two clusters did not join into one big one. To investigate I turned on a sinknode, by resetting it, it would connect to one cluster or the other and I could get some info.
status info from sinknode when connected to the smaller cluster:
status info from sinknode when connected to the larger cluster:
The problem seems to be that the clusterId of both networks are the same so they won't connect, even though they are not part of the same cluster and have a different clusterSize. I have observed behavior like this in the past but it has been elusive and I hadn't been able to gather data like this.
Potential solutions
If my understanding is correct, this could be addressed on two levels, one error and one failsafe: