[BUG] SyncDone by sibling seems to fail sometimes

mgehlmann commented 2 years ago

Describe the bug I have set up a wforce 2.6.0 cluster with all nodes configured both as siblings and sync hosts. The replication and sync after restart of a node seem to work, except for the final syncDone callback by the sync host, which sometimes fails.

Restarting wforce on node wforce0 and wforce1 is chosen as sync host: Logs of wforce0:

Starting webserver
Launched TCP sibling replication listener on IP_WFORCE0:4001
checkSyncHosts: uptime: 756 returned from sync host: IP_WFORCE1:8085/?command=stats
checkSyncHosts: Successful request to synchronize DBs with sync host: IP_WFORCE1:8085
New TCP Replication connection from IP_WFORCE1
Received 2 Replication entries from IP_WFORCE1

Logs of wforce0:

# logs of wforce1
Synchronizing DBs to: IP_WFORCE0:4001, will notify on callback url: IP_WFORCE0:8085/?command=syncDone
Synchronizing DBs to: IP_WFORCE0:4001 was completed. Synced 2 entries.
Synchronizing DBs callback to: IP_WFORCE0:8085/?command=syncDone failed due to no parseable result returned [Error: unexpected end of input]

On other occasions it just works without error, which I did not manage to reliably reproduce so far. While testing I would say I have seen the error in about 1/3 of the restarts. It seems to happen randomly.

I have not been able to trigger the error by manually sending syncDone with curl:

# curl -X GET -H "Content-Type: application/json" http://IP_WFORCE0/?command=syncDone  -u wforce:PASS
{"status":"ok"}

Running this in a loop 1000 times right after the error above gave me 1000 {"status":"ok"}, and "syncDone": 1000 in the target node metrics.

Since the sync seems to work, although I have not tested with a significant number of entries in the DB, this does not seem to affect the service too much. At the moment, however, this leads to the syncDone not showing up in the command metrics of the restarted node. So far I have used this as a signal that the sync is done when restarting the cluster. My Ansible task would restart wforce on one node at a time and wait until this node reports that it received a syncDone (judging from the metrics endpoint) before restarting the next node. As far as I can tell, this is the only way of checking if the sync is complete, other than parsing the logs, right?

To Reproduce Steps to reproduce the behavior:

set up wforce 2.6.0 cluster as siblings and sync hosts
restart server A until the error occurs on server B

Expected behavior The syncDone is sent successfully, and the restarted server counts the received command.

Screenshots None

OS (please complete the following information):

Debian
bullseye

Additional context None

neilcook commented 2 years ago

Can I confirm that you don't see this behaviour with 2.4.x, i.e. this is new to 2.6.0?

neilcook commented 2 years ago

Would it be possible for you to run a tcpdump -A between the two hosts so that I can see what is happening with the syncDone command?

neilcook commented 2 years ago

Finally, can you provide timestamps with the logging for:

# logs of wforce1
Synchronizing DBs to: IP_WFORCE0:4001, will notify on callback url: IP_WFORCE0:8085/?command=syncDone
Synchronizing DBs to: IP_WFORCE0:4001 was completed. Synced 2 entries.
Synchronizing DBs callback to: IP_WFORCE0:8085/?command=syncDone failed due to no parseable result returned [Error: unexpected end of input]

as there is a timeout involved and I want to see if that is being triggered.

neilcook commented 2 years ago

Finally, can you try using "http://" syntax in the addSyncHost command?

neilcook commented 2 years ago

At the moment, however, this leads to the syncDone not showing up in the command metrics of the restarted node.

This part concerns me. The syncDone command is extremely simple, it simply returns a hardcoded string and increments the metrics. The fact that the metrics aren't being incremented indicates that IP_WFORCE0 isn't receiving the syncDone command. That's why the tcpdump would be very helpful.

neilcook commented 2 years ago

I've tried replicating this issue BTW, and I cannot, despite restarting the server hundreds of times.

mgehlmann commented 2 years ago

Would it be possible for you to run a tcpdump -A between the two hosts so that I can see what is happening with the syncDone command? This is on wforce1:

# tcpdump -i eth1 port 8085
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
08:26:54.196705 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [S], seq 3977319922, win 64240, options [mss 1460,sackOK,TS val 1120728698 ecr 0,nop,wscale 7], length 0
08:26:54.196755 IP WFORCE1_IP.8085 > WFORCE0_IP.58242: Flags [S.], seq 3433429847, ack 3977319923, win 65160, options [mss 1460,sackOK,TS val 3515645648 ecr 1120728698,nop,wscale 7], length 0
08:26:54.196910 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [.], ack 1, win 502, options [nop,nop,TS val 1120728698 ecr 3515645648], length 0
08:26:54.197006 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [P.], seq 1:148, ack 1, win 502, options [nop,nop,TS val 1120728698 ecr 3515645648], length 147
08:26:54.197035 IP WFORCE1_IP.8085 > WFORCE0_IP.58242: Flags [.], ack 148, win 508, options [nop,nop,TS val 3515645648 ecr 1120728698], length 0
08:26:54.197285 IP WFORCE1_IP.8085 > WFORCE0_IP.58242: Flags [P.], seq 1:837, ack 148, win 508, options [nop,nop,TS val 3515645648 ecr 1120728698], length 836
08:26:54.197422 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [.], ack 837, win 501, options [nop,nop,TS val 1120728699 ecr 3515645648], length 0
08:26:54.197596 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [F.], seq 148, ack 837, win 501, options [nop,nop,TS val 1120728699 ecr 3515645648], length 0
08:26:54.197662 IP WFORCE1_IP.8085 > WFORCE0_IP.58242: Flags [F.], seq 837, ack 149, win 508, options [nop,nop,TS val 3515645649 ecr 1120728699], length 0
08:26:54.197714 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [S], seq 2393579706, win 64240, options [mss 1460,sackOK,TS val 1120728699 ecr 0,nop,wscale 7], length 0
08:26:54.197746 IP WFORCE1_IP.8085 > WFORCE0_IP.58244: Flags [S.], seq 3894595852, ack 2393579707, win 65160, options [mss 1460,sackOK,TS val 3515645649 ecr 1120728699,nop,wscale 7], length 0
08:26:54.197817 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [.], ack 838, win 501, options [nop,nop,TS val 1120728699 ecr 3515645649], length 0
08:26:54.197940 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [.], ack 1, win 502, options [nop,nop,TS val 1120728699 ecr 3515645649], length 0
08:26:54.198045 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [P.], seq 1:204, ack 1, win 502, options [nop,nop,TS val 1120728699 ecr 3515645649], length 203
08:26:54.198096 IP WFORCE1_IP.8085 > WFORCE0_IP.58244: Flags [.], ack 204, win 508, options [nop,nop,TS val 3515645649 ecr 1120728699], length 0
08:26:54.198105 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [P.], seq 204:461, ack 1, win 502, options [nop,nop,TS val 1120728699 ecr 3515645649], length 257
08:26:54.198263 IP WFORCE1_IP.8085 > WFORCE0_IP.58244: Flags [.], ack 461, win 506, options [nop,nop,TS val 3515645649 ecr 1120728699], length 0
08:26:54.198611 IP WFORCE1_IP.8085 > WFORCE0_IP.58244: Flags [P.], seq 1:161, ack 461, win 506, options [nop,nop,TS val 3515645650 ecr 1120728699], length 160
08:26:54.198760 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [.], ack 161, win 501, options [nop,nop,TS val 1120728700 ecr 3515645650], length 0
08:26:54.198847 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [F.], seq 461, ack 161, win 501, options [nop,nop,TS val 1120728700 ecr 3515645650], length 0
08:26:54.198967 IP WFORCE1_IP.8085 > WFORCE0_IP.58244: Flags [F.], seq 161, ack 462, win 506, options [nop,nop,TS val 3515645650 ecr 1120728700], length 0
08:26:54.199124 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [.], ack 162, win 501, options [nop,nop,TS val 1120728700 ecr 3515645650], length 0
08:26:54.199172 IP WFORCE1_IP.54194 > WFORCE0_IP.8085: Flags [S], seq 3033113515, win 64240, options [mss 1460,sackOK,TS val 3515645650 ecr 0,nop,wscale 7], length 0
08:26:54.199306 IP WFORCE0_IP.8085 > WFORCE1_IP.54194: Flags [R.], seq 0, ack 3033113516, win 0, length 0

Finally, can you provide timestamps with the logging for:

# logs of wforce1
Synchronizing DBs to: IP_WFORCE0:4001, will notify on callback url: IP_WFORCE0:8085/?command=syncDone
Synchronizing DBs to: IP_WFORCE0:4001 was completed. Synced 2 entries.
Synchronizing DBs callback to: IP_WFORCE0:8085/?command=syncDone failed due to no parseable result returned [Error: unexpected end of input]

as there is a timeout involved and I want to see if that is being triggered.

The TCP shows the [.R] within < 3 ms. In the wforce syslog output it happens within the same second (has 1 s resolution).

Finally, can you try using "http://" syntax in the addSyncHost command?

I had omitted the protocol before, the last test with the TCP dump was after adding "http://".

Can I confirm that you don't see this behaviour with 2.4.x, i.e. this is new to 2.6.0?

At the moment I am testing on Bullseye hosts. As I recall, 2.4.x does not have builder support for bullseye. I'll have to get back to you either after manually building it, or compare Buster hosts.

neilcook commented 2 years ago

Ok so a SYN followed immediately by a RESET, ACK means that the destination is not listening on the port. The webserver listeners are started in background threads, and although it seems unlikely, I guess it is possible that the listeners are not setup before the replication finishes. That would also explain why it's random, as it's timing related.

Can you send the full logs for IP_WFORCE0 from startup for a failed run and also for a successful run?

neilcook commented 2 years ago

Ok so a SYN followed immediately by a RESET, ACK means that the destination is not listening on the port. The webserver listeners are started in background threads, and although it seems unlikely, I guess it is possible that the listeners are not setup before the replication finishes. That would also explain why it's random, as it's timing related.

Can you send the full logs for IP_WFORCE0 from startup for a failed run and also for a successful run?

Actually no scratch that, I see the Starting webserver log line, which means the thread is running.

neilcook commented 2 years ago

Ok so a SYN followed immediately by a RESET, ACK means that the destination is not listening on the port. The webserver listeners are started in background threads, and although it seems unlikely, I guess it is possible that the listeners are not setup before the replication finishes. That would also explain why it's random, as it's timing related. Can you send the full logs for IP_WFORCE0 from startup for a failed run and also for a successful run?

Actually no scratch that, I see the Starting webserver log line, which means the thread is running.

Although that still doesn't necessarily mean that drogon has finished initialising and is listening on/accepting connections on that port.

mgehlmann commented 2 years ago

Well, that also explains why this usually does not happen. I have been testing with very little mock data in the SSDB, so the sync is complete within a few milliseconds.

neilcook commented 2 years ago

I finally managed to replicate it myself using the docker image on bullseye. I'm not sure I have a really good fix for this - the only thing I can think of it to add some kind of artificial sleep before triggering the syncDB stuff.

neilcook commented 2 years ago

I found a way to fix this - there is a function to check if the drogon framework is running. See PR #356.

mgehlmann commented 2 years ago

Awesome! Thanks!

PowerDNS / weakforced

[BUG] SyncDone by sibling seems to fail sometimes #358