adshares / ads

ADS Blockchain
Other
21 stars 4 forks source link

3 nodes stopped after entering readonly mode #151

Closed PawelPodkalicki closed 5 years ago

PawelPodkalicki commented 6 years ago

During tests three nodes stopped simultaneously. Number of this nodes were 2, 3, 10 (dec). Each of them logged connection error with peer. They recognized that at least one peer is in ST_STOPED state.

node2 - broken connection with node3

[1534540800] CLOCK: 00 (check:0 wait:0 peers:34 allpeers:35 hash:133C7657 now:5B773A00 ticket:0 msg:0 txs:0) [,3*,4,5,6,7,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,42,48,54,64]

node3 - broken connection with node1, node2, node10

[1534540800] CLOCK: 00 (check:0 wait:0 peers:33 allpeers:34 hash:133C7657 now:5B773A00 ticket:0 msg:0 txs:0) [,1*,2*,4,5,6,7,8,9,10*,11,12,13,15,16,17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,33,34,35,42]

The star character * after node number means that node is in ST_STOPED state.

PawelPodkalicki commented 6 years ago

I compared logs with working node5. It had SYNC entries, which were missing in stopped nodes.

[1534540288] 001C SYNC OK
[1534540288] 0015 SYNC OK
[1534540288] 0014 SYNC OK
[1534540288] 0004 SYNC OK
[1534540288] 0023 SYNC OK
[1534540288] 0012 SYNC OK
[1534540288] 0009 SYNC OK
[1534540288] 0008 SYNC OK
[1534540288] 000C SYNC OK
[1534540288] 001D SYNC OK
[1534540289] 0020 SYNC OK

What is more node5 accepted block in 2 seconds, while nodes that stopped (n2, n3, n10) didn't accept block in half block period (512/2=256 s) and entered readonly state.

PawelPodkalicki commented 6 years ago

In case of n3 and n10 error KILL_DETECTED is present. From node3 log:

[1534540306] 0002 KILL detected ! (HANDLE READ HEADER), leaving
[1534540306] 000A KILL detected ! (HANDLE READ HEADER), leaving
[1534540306] CLOCK: FFFFFFFFFFFFFFEE (check:0 wait:0 peers:33 allpeers:34 hash:A9BA7963 now:5B773800 ticket:0 msg:0 txs:0) [,1,2*,4,5,6,7,8,9,10*,11,12,13,15,16,17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,33,34,35,42]
[1534540307] 0001 KILL detected ! (HANDLE READ HEADER), leaving
[1534540307] CLOCK: FFFFFFFFFFFFFFED (check:0 wait:0 peers:33 allpeers:34 hash:A9BA7963 now:5B773800 ticket:0 msg:0 txs:0) [,1*,2*,4,5,6,7,8,9,10*,11,12,13,15,16,17,18,19,20,22,23,24,25,26,27,28,29,30,31,32,33,34,35,42]

From code analyze, node2 also recognized that his peer is in ST_STOPED state, but reason is unknown. Basing on escd\peer.hpp two causes are possible.

l. 93
    void iorun() {
        DLOG("%04X PEER IORUN START\n",svid);
        try {
            peer_io_service_.run();
        } //Now we know the server is down.
        catch (std::exception& e) {
//FIXME, stop peer after Broken pipe (now does not stop if peer ends with 'assert')
//FIXME, wipe out inactive peers (better solution)
            DLOG("%04X CATCH IORUN Service.Run error:%s\n",svid,e.what());
            leave();
        }        
    }

l. 151
    void connect(const boost::system::error_code& error)
    {
        //only outgoing connection
        if(error) {            
            DLOG("%04X PEER ACCEPT ERROR address %s: port%d\n",svid, addr.c_str(), port);            
            leave();
            return;
        }
(...)

In other cases leave() function occurs together with ELOG function. Only in these two with DLOG.

PawelPodkalicki commented 6 years ago

What does ST_STOPED state of peer mean? Why does it not change, if peer works correctly?

PawelPodkalicki commented 6 years ago

Attached log from stopeed node and one that work correctly in the same time. 20180820_peer_st_stoped.zip

PawelPodkalicki commented 6 years ago

Last remark is that one block before 3 nodes stopped all other nodes has temporary peer disconnection. Only that three has stable amount of peers.

jzemlo commented 6 years ago

error.log

m-pilarczyk commented 6 years ago

000f.log 0005.log