[BUG] wforce daemon stuck in 'warmup' state

sshipway commented 1 year ago

Describe the bug Occasionally, the wforce daemon will get stuck in 'warmup' status; a 'ping' command will return a status "warmup" and this never seems to go, even after the node has received the data from the other cluster members. Log shows

2023-02-03T06:01:01.142910+00:00 xm1wfd01 wforce[6507]: Received 521256 Replication entries from 10.23.30.122

but status still "warmup"

To Reproduce This cannot be reliably reproduced on demand, but it often happens when the in-memory database is full and one node is restarted. Clearing out the database (by restarting all the nodes in the cluster, so none are old enough to be an initial replication source) solves the issue.

Expected behavior I would expect the state to return to "ok" after a short while, once the database has been replicated form another node.

OS (please complete the following information):

OS: centos
Version 7.9
Wforce 2.6.1

Additional context There are 3 nodes in the cluster, and the in-memory database is large. Monitoring shows it is full at times when the problem hits (but it generally fills within a day anyway)

field_map = {}
field_map["countriesPerMailbox"] = "hll"
field_map["detectedSpam"] = "int"
field_map["failedPasswords"] = "hll"
newStringStatsDB("OneDayDB",3600,24, field_map)
ssdb = getStringStatsDB("OneDayDB")
ssdb:twEnableReplication()
ssdb:twSetMaxSize(1000000)
ssdb:twSetv4Prefix(32)
field_map = {}
field_map["failedPasswords"] = "hll"
field_map["mailboxesPerIp"] = "hll"
newStringStatsDB("OneHourDB",600,6, field_map)
ssdb = getStringStatsDB("OneHourDB")
ssdb:twEnableReplication()
ssdb:twSetMaxSize(500000)
ssdb:twSetv4Prefix(32)

During normal operation, there are occasionally error messages logged -

2023-02-03T06:00:06.718974+00:00 xm1wfd01 wforce[6507]: parseReceivedReplicationMsg: max sibling recv queue size (5000) reached - dropping replication msg

however the max receive queue size is not configurable

neilcook commented 1 year ago

So the syncDB flow looks like this:

The server starting up contacts a suitable wforce server over HTTP (more on that in a minute) - this triggers warmup mode
The selected server goes through its statsDBs, locking them one at a time and sending all their data to the starting server using the replication protocol (always over TCP)
The selected server, when done, contacts the starting server to say "all done" - this is the trigger for the starting server to come out of warmup mode

So the first thing to look for when debugging this is the logs of the server sending the sync messages - there should be some logs about why the "/?command=syncDone" endpoint failed. If you can post those here then we can start to see what is going on. In fact anything in the logs of the sending server for the time period in question would be useful.

The second thing I would say is that I would not recommend using running wforce servers as syncDB servers. My recommendation would be to run separate wforce servers, which never get any traffic, but which are full cluster members, as syncDB servers. Why? Well you've seen for yourself what happens - when synchronising, the statsDBs get locked (in turn) and so replication cannot happen and thus you see replication messages backing up. This is much less of a problem on a server that isn't handling any queries. If they're running on separate hardware, then even better.

Incidentally the max receive queue size is configurable - the Lua function setMaxSiblingQueueSize() sets the receive and send queue size simultaneously.

neilcook commented 1 year ago

BTW I don't know if you're using sharded statsDBs but that will help with any effects of freezes during synchronisation, as only a single shard gets locked at a time rather than the entire DB.

newShardedStringStatsDB() is almost a drop-in replacement for newStringStatsDB().

neilcook commented 1 year ago

Hi, any update on this issue? Did you get any logging from the server sending the sync messages? That will contain the most useful hints as to why the server starting up stays in warmup mode, as I said there should be some logs about why the "/?command=syncDone" endpoint failed.

neilcook commented 1 year ago

Hi Steve, I'd like to debug this issue further but I can't without more information...

sshipway commented 10 months ago

Sorry - I'm not able to get much more in the way of logs for this, as it only occurs during a startup on our largest cluster and only infrequently :( We are not using the ShardedStringStatsDb but that looks like a good idea now that we've upgraded to 2.8.0 though it will mean changes to our configuration templates etc. I will try the recommended changes -

Using ShardedStringStats
Increasing sibling replication queue length
Adding a separate no-query sync-source node

neilcook commented 1 week ago

Does this issue still happen or did the changes fix it?

sshipway commented 1 week ago

We are now using the sharded database, and the increased queue length, which appears to have fixed our other issue of dropped updates. I've not been able to obtain resources for a sync-source node so have not implemented that part. After these changes, the warmup time has dropped from 15min+ down to 3 minutes, so I would say this has solved the issue, and the sharded database should be recommended for any larger scale setup (we have 1000000 items in our table) Thanks!

PowerDNS / weakforced

[BUG] wforce daemon stuck in 'warmup' state #387