Closed mgehlmann closed 3 years ago
Thanks for the report, I will look into it. I might need to ask for more information, will let you know.
So can I ask, does this issue only happen when there are active report/allow commands against host A, or can you also reproduce it when there are no active report/allow commands against host A? This will help me track down where the potential issue is.
Once #343 passes CI, can you try building from that, and see if it fixes your issue? I haven't reproduced it, but looking at the code, and your logs, I think it will.
Great, thanks! I'll try and let you know.
You can ignore the fact that the docker image CI fails currently - that's not relevant
I tested and the Syncing works perfectly now! I'll close the issue. Thanks for the quick fix!
JFYI this fix will be in 2.4.1 (due imminently).
Also, a piece of advice, depending on how many entries you have in your stats DBs, you might want to use sharded stats DBs. Not only does that reduce lock contention when reading/writing, but it also reduces the amount of time that each DB is locked when syncing, which can prevent updates from being dropped. If you don't have very big stats DBs (in your example I saw 56K which is pretty small) then don't worry about it.
Great, thanks! 56K was actually only the number of entries until the error occurred. There were a few million in total. (ATM just simulating requests to test it). Currently I'm at 20 shards. Besides the increased memory usage, would you expect issues when increasing the number of shards way further?
Increasing the number of shards won’t dramatically increase memory usage actually. I wouldn’t expect any major issues increasing by a lot, but I also wouldn’t expect any major benefit either. I would say 20 would be ok for a few million entries,
Neil
Sent from my iPhone
On 24 Jun 2021, at 16:26, Mathias Gehlmann @.***> wrote:
Great, thanks! 56K was actually only the number of entries until the error occurred. There were a few million in total. (ATM just simulating requests to test it). Currently I'm at 20 shards. Besides the increased memory usage, would you expect issues when increasing the number of shards way further?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.
Describe the bug We have a 2 host cluster, where the two nodes are configured as each others Sibling+SyncHost. Weakforced version: 2.4.0 built with builder.sh
The Sibling replication seems to work flawlessly. After restarting one of the nodes, however, in about half of the cases the synchronization fails after only a portion of the entries have been transferred with this error message:
parseTCPReplication: error reading size, got 1 bytes, but was expecting 2 bytes
Restarting the same node again right after sometimes leads to a full synchronization, sometimes to the same error. (See logs below).
To Reproduce Steps to reproduce the behavior:
Expected behavior Data is fully synchronized every time.
Screenshots
OS (please complete the following information):
Additional context Log output of HOST_A and HOST_B. After restarting HOST_B HOST_A logs attempt 1 (no error):
Synchronizing DBs to: IP_HOST_B:4001, will notify on callback url: http://IP_HOST_B:8085/?command=syncDone
Synchronizing DBs to: IP_HOST_B:4001 was completed. Synced 3785 entries.
Synchronizing DBs callback to: http://IP_HOST_B:8085/?command=syncDone was successful
HOST_B logs attempt 1 (no error):
WforceWebserver launched on 0.0.0.0:8085
Accepting control connections on 127.0.0.1:4004
Launched UDP sibling replication listener on IP_HOST_B:4001
Launched TCP sibling replication listener on IP_HOST_B:4001
checkSyncHosts: uptime: 55935 returned from sync host: http://IP_HOST_A:8085/?command=stats
checkSyncHosts: Successful request to synchronize DBs with sync host: IP_HOST_A:8085
New TCP Replication connection from IP_HOST_A
Received 3785 Replication entries from IP_HOST_A
Log output of HOST_A and HOST_B. Second restart produces error:
HOST_A logs attempt 2 (error occurs):
Synchronizing DBs to: IP_HOST_B:4001, will notify on callback url: http://IP_HOST_B:8085/?command=syncDone
_Synchronizing DBs to: IP_HOST_B:4001 did not complete. [Network Error: Writing to a socket: Connection reset by peer]
_Synchronizing DBs callback to: http://IP_HOST_B:8085/?command=syncDone was successful
HOST_B logs attempt 2 (error occurs):
Launched UDP sibling replication listener on IP_HOST_B:4001
Launched TCP sibling replication listener on IP_HOST_B:4001
checkSyncHosts: uptime: 56246 returned from sync host: http://IP_HOST_A:8085/?command=stats
checkSyncHosts: Successful request to synchronize DBs with sync host: IP_HOST_A:8085
New TCP Replication connection from IP_HOST_A
parseTCPReplication: error reading size, got 1 bytes, but was expecting 2 bytes
Received 1017 Replication entries from IP_HOST_A
Third restart worked again without error. Output is the same as for attempt 1