PowerDNS / weakforced

Anti-Abuse for servers at authentication time
GNU General Public License v3.0
124 stars 33 forks source link

parseTCPReplication error after restart of a SyncHost [BUG] #342

Closed mgehlmann closed 3 years ago

mgehlmann commented 3 years ago

Describe the bug We have a 2 host cluster, where the two nodes are configured as each others Sibling+SyncHost. Weakforced version: 2.4.0 built with builder.sh

The Sibling replication seems to work flawlessly. After restarting one of the nodes, however, in about half of the cases the synchronization fails after only a portion of the entries have been transferred with this error message: parseTCPReplication: error reading size, got 1 bytes, but was expecting 2 bytes

Restarting the same node again right after sometimes leads to a full synchronization, sometimes to the same error. (See logs below).

To Reproduce Steps to reproduce the behavior:

  1. Configure 2 node cluster with HOST_A as Sibling and Sync host of HOST_B and vice versa.
  2. POST reports to either of the nodes
  3. Restart one of the nodes
  4. If the error does not occur, repeat restarting the same host. The error occurs quite frequently.

Expected behavior Data is fully synchronized every time.

Screenshots

OS (please complete the following information):

Additional context Log output of HOST_A and HOST_B. After restarting HOST_B HOST_A logs attempt 1 (no error): Synchronizing DBs to: IP_HOST_B:4001, will notify on callback url: http://IP_HOST_B:8085/?command=syncDone Synchronizing DBs to: IP_HOST_B:4001 was completed. Synced 3785 entries. Synchronizing DBs callback to: http://IP_HOST_B:8085/?command=syncDone was successful

HOST_B logs attempt 1 (no error): WforceWebserver launched on 0.0.0.0:8085 Accepting control connections on 127.0.0.1:4004 Launched UDP sibling replication listener on IP_HOST_B:4001 Launched TCP sibling replication listener on IP_HOST_B:4001 checkSyncHosts: uptime: 55935 returned from sync host: http://IP_HOST_A:8085/?command=stats checkSyncHosts: Successful request to synchronize DBs with sync host: IP_HOST_A:8085 New TCP Replication connection from IP_HOST_A Received 3785 Replication entries from IP_HOST_A

Log output of HOST_A and HOST_B. Second restart produces error:

HOST_A logs attempt 2 (error occurs): Synchronizing DBs to: IP_HOST_B:4001, will notify on callback url: http://IP_HOST_B:8085/?command=syncDone _Synchronizing DBs to: IP_HOST_B:4001 did not complete. [Network Error: Writing to a socket: Connection reset by peer]_ Synchronizing DBs callback to: http://IP_HOST_B:8085/?command=syncDone was successful

HOST_B logs attempt 2 (error occurs): Launched UDP sibling replication listener on IP_HOST_B:4001 Launched TCP sibling replication listener on IP_HOST_B:4001 checkSyncHosts: uptime: 56246 returned from sync host: http://IP_HOST_A:8085/?command=stats checkSyncHosts: Successful request to synchronize DBs with sync host: IP_HOST_A:8085 New TCP Replication connection from IP_HOST_A parseTCPReplication: error reading size, got 1 bytes, but was expecting 2 bytes Received 1017 Replication entries from IP_HOST_A

Third restart worked again without error. Output is the same as for attempt 1

neilcook commented 3 years ago

Thanks for the report, I will look into it. I might need to ask for more information, will let you know.

neilcook commented 3 years ago

So can I ask, does this issue only happen when there are active report/allow commands against host A, or can you also reproduce it when there are no active report/allow commands against host A? This will help me track down where the potential issue is.

neilcook commented 3 years ago

Once #343 passes CI, can you try building from that, and see if it fixes your issue? I haven't reproduced it, but looking at the code, and your logs, I think it will.

mgehlmann commented 3 years ago

Great, thanks! I'll try and let you know.

neilcook commented 3 years ago

You can ignore the fact that the docker image CI fails currently - that's not relevant

mgehlmann commented 3 years ago

I tested and the Syncing works perfectly now! I'll close the issue. Thanks for the quick fix!

neilcook commented 3 years ago

JFYI this fix will be in 2.4.1 (due imminently).

Also, a piece of advice, depending on how many entries you have in your stats DBs, you might want to use sharded stats DBs. Not only does that reduce lock contention when reading/writing, but it also reduces the amount of time that each DB is locked when syncing, which can prevent updates from being dropped. If you don't have very big stats DBs (in your example I saw 56K which is pretty small) then don't worry about it.

mgehlmann commented 3 years ago

Great, thanks! 56K was actually only the number of entries until the error occurred. There were a few million in total. (ATM just simulating requests to test it). Currently I'm at 20 shards. Besides the increased memory usage, would you expect issues when increasing the number of shards way further?

neilcook commented 3 years ago

Increasing the number of shards won’t dramatically increase memory usage actually. I wouldn’t expect any major issues increasing by a lot, but I also wouldn’t expect any major benefit either. I would say 20 would be ok for a few million entries,

Neil

Sent from my iPhone

On 24 Jun 2021, at 16:26, Mathias Gehlmann @.***> wrote:

 Great, thanks! 56K was actually only the number of entries until the error occurred. There were a few million in total. (ATM just simulating requests to test it). Currently I'm at 20 shards. Besides the increased memory usage, would you expect issues when increasing the number of shards way further?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.