Closed PawelPodkalicki closed 5 years ago
I compared logs with working node5. It had SYNC
entries, which were missing in stopped nodes.
[1534540288] 001C SYNC OK
[1534540288] 0015 SYNC OK
[1534540288] 0014 SYNC OK
[1534540288] 0004 SYNC OK
[1534540288] 0023 SYNC OK
[1534540288] 0012 SYNC OK
[1534540288] 0009 SYNC OK
[1534540288] 0008 SYNC OK
[1534540288] 000C SYNC OK
[1534540288] 001D SYNC OK
[1534540289] 0020 SYNC OK
What is more node5 accepted block in 2 seconds, while nodes that stopped (n2, n3, n10) didn't accept block in half block period (512/2=256 s) and entered readonly state.
In case of n3 and n10 error KILL_DETECTED
is present.
From node3 log:
[1534540306] 0002 KILL detected ! (HANDLE READ HEADER), leaving
[1534540306] 000A KILL detected ! (HANDLE READ HEADER), leaving
[1534540306] CLOCK: FFFFFFFFFFFFFFEE (check:0 wait:0 peers:33 allpeers:34 hash:A9BA7963 now:5B773800 ticket:0 msg:0 txs:0) [,1,2*,4,5,6,7,8,9,10*,11,12,13,15,16,17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,33,34,35,42]
[1534540307] 0001 KILL detected ! (HANDLE READ HEADER), leaving
[1534540307] CLOCK: FFFFFFFFFFFFFFED (check:0 wait:0 peers:33 allpeers:34 hash:A9BA7963 now:5B773800 ticket:0 msg:0 txs:0) [,1*,2*,4,5,6,7,8,9,10*,11,12,13,15,16,17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,33,34,35,42]
From code analyze, node2 also recognized that his peer is in ST_STOPED
state, but reason is unknown.
Basing on escd\peer.hpp
two causes are possible.
l. 93
void iorun() {
DLOG("%04X PEER IORUN START\n",svid);
try {
peer_io_service_.run();
} //Now we know the server is down.
catch (std::exception& e) {
//FIXME, stop peer after Broken pipe (now does not stop if peer ends with 'assert')
//FIXME, wipe out inactive peers (better solution)
DLOG("%04X CATCH IORUN Service.Run error:%s\n",svid,e.what());
leave();
}
}
l. 151
void connect(const boost::system::error_code& error)
{
//only outgoing connection
if(error) {
DLOG("%04X PEER ACCEPT ERROR address %s: port%d\n",svid, addr.c_str(), port);
leave();
return;
}
(...)
In other cases leave()
function occurs together with ELOG
function. Only in these two with DLOG
.
What does ST_STOPED
state of peer mean? Why does it not change, if peer works correctly?
Attached log from stopeed node and one that work correctly in the same time. 20180820_peer_st_stoped.zip
Last remark is that one block before 3 nodes stopped all other nodes has temporary peer disconnection. Only that three has stable amount of peers.
During tests three nodes stopped simultaneously. Number of this nodes were 2, 3, 10 (dec). Each of them logged connection error with peer. They recognized that at least one peer is in
ST_STOPED
state.node2 - broken connection with node3
node3 - broken connection with node1, node2, node10
The star character
*
after node number means that node is inST_STOPED
state.