parseTCPReplication error after restart of a SyncHost [BUG]

mgehlmann commented 3 years ago

Describe the bug We have a 2 host cluster, where the two nodes are configured as each others Sibling+SyncHost. Weakforced version: 2.4.0 built with builder.sh

The Sibling replication seems to work flawlessly. After restarting one of the nodes, however, in about half of the cases the synchronization fails after only a portion of the entries have been transferred with this error message: parseTCPReplication: error reading size, got 1 bytes, but was expecting 2 bytes

Restarting the same node again right after sometimes leads to a full synchronization, sometimes to the same error. (See logs below).

To Reproduce Steps to reproduce the behavior:

Configure 2 node cluster with HOST_A as Sibling and Sync host of HOST_B and vice versa.
POST reports to either of the nodes
Restart one of the nodes
If the error does not occur, repeat restarting the same host. The error occurs quite frequently.

Expected behavior Data is fully synchronized every time.

Screenshots

OS (please complete the following information):

OS: Debian
Version: stretch

Additional context Log output of HOST_A and HOST_B. After restarting HOST_B HOST_A logs attempt 1 (no error): Synchronizing DBs to: IP_HOST_B:4001, will notify on callback url: http://IP_HOST_B:8085/?command=syncDone Synchronizing DBs to: IP_HOST_B:4001 was completed. Synced 3785 entries. Synchronizing DBs callback to: http://IP_HOST_B:8085/?command=syncDone was successful

HOST_B logs attempt 1 (no error): WforceWebserver launched on 0.0.0.0:8085 Accepting control connections on 127.0.0.1:4004 Launched UDP sibling replication listener on IP_HOST_B:4001 Launched TCP sibling replication listener on IP_HOST_B:4001 checkSyncHosts: uptime: 55935 returned from sync host: http://IP_HOST_A:8085/?command=stats checkSyncHosts: Successful request to synchronize DBs with sync host: IP_HOST_A:8085 New TCP Replication connection from IP_HOST_A Received 3785 Replication entries from IP_HOST_A

Log output of HOST_A and HOST_B. Second restart produces error:

HOST_A logs attempt 2 (error occurs): Synchronizing DBs to: IP_HOST_B:4001, will notify on callback url: http://IP_HOST_B:8085/?command=syncDone _Synchronizing DBs to: IP_HOST_B:4001 did not complete. [Network Error: Writing to a socket: Connection reset by peer]_ Synchronizing DBs callback to: http://IP_HOST_B:8085/?command=syncDone was successful

HOST_B logs attempt 2 (error occurs): Launched UDP sibling replication listener on IP_HOST_B:4001 Launched TCP sibling replication listener on IP_HOST_B:4001 checkSyncHosts: uptime: 56246 returned from sync host: http://IP_HOST_A:8085/?command=stats checkSyncHosts: Successful request to synchronize DBs with sync host: IP_HOST_A:8085 New TCP Replication connection from IP_HOST_A parseTCPReplication: error reading size, got 1 bytes, but was expecting 2 bytes Received 1017 Replication entries from IP_HOST_A

Third restart worked again without error. Output is the same as for attempt 1