Closed mgehlmann closed 2 years ago
Can I confirm that you don't see this behaviour with 2.4.x, i.e. this is new to 2.6.0?
Would it be possible for you to run a tcpdump -A between the two hosts so that I can see what is happening with the syncDone command?
Finally, can you provide timestamps with the logging for:
# logs of wforce1
Synchronizing DBs to: IP_WFORCE0:4001, will notify on callback url: IP_WFORCE0:8085/?command=syncDone
Synchronizing DBs to: IP_WFORCE0:4001 was completed. Synced 2 entries.
Synchronizing DBs callback to: IP_WFORCE0:8085/?command=syncDone failed due to no parseable result returned [Error: unexpected end of input]
as there is a timeout involved and I want to see if that is being triggered.
Finally, can you try using "http://" syntax in the addSyncHost command?
At the moment, however, this leads to the syncDone not showing up in the command metrics of the restarted node.
This part concerns me. The syncDone command is extremely simple, it simply returns a hardcoded string and increments the metrics. The fact that the metrics aren't being incremented indicates that IP_WFORCE0 isn't receiving the syncDone command. That's why the tcpdump would be very helpful.
I've tried replicating this issue BTW, and I cannot, despite restarting the server hundreds of times.
Would it be possible for you to run a tcpdump -A between the two hosts so that I can see what is happening with the syncDone command? This is on
wforce1
:# tcpdump -i eth1 port 8085 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes 08:26:54.196705 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [S], seq 3977319922, win 64240, options [mss 1460,sackOK,TS val 1120728698 ecr 0,nop,wscale 7], length 0 08:26:54.196755 IP WFORCE1_IP.8085 > WFORCE0_IP.58242: Flags [S.], seq 3433429847, ack 3977319923, win 65160, options [mss 1460,sackOK,TS val 3515645648 ecr 1120728698,nop,wscale 7], length 0 08:26:54.196910 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [.], ack 1, win 502, options [nop,nop,TS val 1120728698 ecr 3515645648], length 0 08:26:54.197006 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [P.], seq 1:148, ack 1, win 502, options [nop,nop,TS val 1120728698 ecr 3515645648], length 147 08:26:54.197035 IP WFORCE1_IP.8085 > WFORCE0_IP.58242: Flags [.], ack 148, win 508, options [nop,nop,TS val 3515645648 ecr 1120728698], length 0 08:26:54.197285 IP WFORCE1_IP.8085 > WFORCE0_IP.58242: Flags [P.], seq 1:837, ack 148, win 508, options [nop,nop,TS val 3515645648 ecr 1120728698], length 836 08:26:54.197422 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [.], ack 837, win 501, options [nop,nop,TS val 1120728699 ecr 3515645648], length 0 08:26:54.197596 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [F.], seq 148, ack 837, win 501, options [nop,nop,TS val 1120728699 ecr 3515645648], length 0 08:26:54.197662 IP WFORCE1_IP.8085 > WFORCE0_IP.58242: Flags [F.], seq 837, ack 149, win 508, options [nop,nop,TS val 3515645649 ecr 1120728699], length 0 08:26:54.197714 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [S], seq 2393579706, win 64240, options [mss 1460,sackOK,TS val 1120728699 ecr 0,nop,wscale 7], length 0 08:26:54.197746 IP WFORCE1_IP.8085 > WFORCE0_IP.58244: Flags [S.], seq 3894595852, ack 2393579707, win 65160, options [mss 1460,sackOK,TS val 3515645649 ecr 1120728699,nop,wscale 7], length 0 08:26:54.197817 IP WFORCE0_IP.58242 > WFORCE1_IP.8085: Flags [.], ack 838, win 501, options [nop,nop,TS val 1120728699 ecr 3515645649], length 0 08:26:54.197940 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [.], ack 1, win 502, options [nop,nop,TS val 1120728699 ecr 3515645649], length 0 08:26:54.198045 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [P.], seq 1:204, ack 1, win 502, options [nop,nop,TS val 1120728699 ecr 3515645649], length 203 08:26:54.198096 IP WFORCE1_IP.8085 > WFORCE0_IP.58244: Flags [.], ack 204, win 508, options [nop,nop,TS val 3515645649 ecr 1120728699], length 0 08:26:54.198105 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [P.], seq 204:461, ack 1, win 502, options [nop,nop,TS val 1120728699 ecr 3515645649], length 257 08:26:54.198263 IP WFORCE1_IP.8085 > WFORCE0_IP.58244: Flags [.], ack 461, win 506, options [nop,nop,TS val 3515645649 ecr 1120728699], length 0 08:26:54.198611 IP WFORCE1_IP.8085 > WFORCE0_IP.58244: Flags [P.], seq 1:161, ack 461, win 506, options [nop,nop,TS val 3515645650 ecr 1120728699], length 160 08:26:54.198760 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [.], ack 161, win 501, options [nop,nop,TS val 1120728700 ecr 3515645650], length 0 08:26:54.198847 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [F.], seq 461, ack 161, win 501, options [nop,nop,TS val 1120728700 ecr 3515645650], length 0 08:26:54.198967 IP WFORCE1_IP.8085 > WFORCE0_IP.58244: Flags [F.], seq 161, ack 462, win 506, options [nop,nop,TS val 3515645650 ecr 1120728700], length 0 08:26:54.199124 IP WFORCE0_IP.58244 > WFORCE1_IP.8085: Flags [.], ack 162, win 501, options [nop,nop,TS val 1120728700 ecr 3515645650], length 0 08:26:54.199172 IP WFORCE1_IP.54194 > WFORCE0_IP.8085: Flags [S], seq 3033113515, win 64240, options [mss 1460,sackOK,TS val 3515645650 ecr 0,nop,wscale 7], length 0 08:26:54.199306 IP WFORCE0_IP.8085 > WFORCE1_IP.54194: Flags [R.], seq 0, ack 3033113516, win 0, length 0
Finally, can you provide timestamps with the logging for:
# logs of wforce1 Synchronizing DBs to: IP_WFORCE0:4001, will notify on callback url: IP_WFORCE0:8085/?command=syncDone Synchronizing DBs to: IP_WFORCE0:4001 was completed. Synced 2 entries. Synchronizing DBs callback to: IP_WFORCE0:8085/?command=syncDone failed due to no parseable result returned [Error: unexpected end of input]
as there is a timeout involved and I want to see if that is being triggered.
The TCP shows the [.R] within < 3 ms. In the wforce syslog output it happens within the same second (has 1 s resolution).
Finally, can you try using "http://" syntax in the addSyncHost command?
I had omitted the protocol before, the last test with the TCP dump was after adding "http://".
Can I confirm that you don't see this behaviour with 2.4.x, i.e. this is new to 2.6.0?
At the moment I am testing on Bullseye hosts. As I recall, 2.4.x does not have builder support for bullseye. I'll have to get back to you either after manually building it, or compare Buster hosts.
Ok so a SYN followed immediately by a RESET, ACK means that the destination is not listening on the port. The webserver listeners are started in background threads, and although it seems unlikely, I guess it is possible that the listeners are not setup before the replication finishes. That would also explain why it's random, as it's timing related.
Can you send the full logs for IP_WFORCE0 from startup for a failed run and also for a successful run?
Ok so a SYN followed immediately by a RESET, ACK means that the destination is not listening on the port. The webserver listeners are started in background threads, and although it seems unlikely, I guess it is possible that the listeners are not setup before the replication finishes. That would also explain why it's random, as it's timing related.
Can you send the full logs for IP_WFORCE0 from startup for a failed run and also for a successful run?
Actually no scratch that, I see the Starting webserver
log line, which means the thread is running.
Ok so a SYN followed immediately by a RESET, ACK means that the destination is not listening on the port. The webserver listeners are started in background threads, and although it seems unlikely, I guess it is possible that the listeners are not setup before the replication finishes. That would also explain why it's random, as it's timing related. Can you send the full logs for IP_WFORCE0 from startup for a failed run and also for a successful run?
Actually no scratch that, I see the
Starting webserver
log line, which means the thread is running.
Although that still doesn't necessarily mean that drogon has finished initialising and is listening on/accepting connections on that port.
Well, that also explains why this usually does not happen. I have been testing with very little mock data in the SSDB, so the sync is complete within a few milliseconds.
I finally managed to replicate it myself using the docker image on bullseye. I'm not sure I have a really good fix for this - the only thing I can think of it to add some kind of artificial sleep before triggering the syncDB stuff.
I found a way to fix this - there is a function to check if the drogon framework is running. See PR #356.
Awesome! Thanks!
Describe the bug I have set up a wforce 2.6.0 cluster with all nodes configured both as siblings and sync hosts. The replication and sync after restart of a node seem to work, except for the final
syncDone
callback by the sync host, which sometimes fails.Restarting wforce on node
wforce0
andwforce1
is chosen as sync host: Logs ofwforce0
:Logs of
wforce0
:On other occasions it just works without error, which I did not manage to reliably reproduce so far. While testing I would say I have seen the error in about 1/3 of the restarts. It seems to happen randomly.
I have not been able to trigger the error by manually sending
syncDone
with curl:Running this in a loop 1000 times right after the error above gave me 1000
{"status":"ok"}
, and"syncDone": 1000
in the target node metrics.Since the sync seems to work, although I have not tested with a significant number of entries in the DB, this does not seem to affect the service too much. At the moment, however, this leads to the
syncDone
not showing up in the command metrics of the restarted node. So far I have used this as a signal that the sync is done when restarting the cluster. My Ansible task would restart wforce on one node at a time and wait until this node reports that it received asyncDone
(judging from the metrics endpoint) before restarting the next node. As far as I can tell, this is the only way of checking if the sync is complete, other than parsing the logs, right?To Reproduce Steps to reproduce the behavior:
Expected behavior The
syncDone
is sent successfully, and the restarted server counts the received command.Screenshots None
OS (please complete the following information):
Additional context None