PowerDNS / weakforced

Anti-Abuse for servers at authentication time
GNU General Public License v3.0
123 stars 33 forks source link

[BUG] wforce daemon stuck in 'warmup' state #387

Closed sshipway closed 1 week ago

sshipway commented 1 year ago

Describe the bug Occasionally, the wforce daemon will get stuck in 'warmup' status; a 'ping' command will return a status "warmup" and this never seems to go, even after the node has received the data from the other cluster members. Log shows

2023-02-03T06:01:01.142910+00:00 xm1wfd01 wforce[6507]: Received 521256 Replication entries from 10.23.30.122

but status still "warmup"

To Reproduce This cannot be reliably reproduced on demand, but it often happens when the in-memory database is full and one node is restarted. Clearing out the database (by restarting all the nodes in the cluster, so none are old enough to be an initial replication source) solves the issue.

Expected behavior I would expect the state to return to "ok" after a short while, once the database has been replicated form another node.

OS (please complete the following information):

Additional context There are 3 nodes in the cluster, and the in-memory database is large. Monitoring shows it is full at times when the problem hits (but it generally fills within a day anyway)

field_map = {}
field_map["countriesPerMailbox"] = "hll"
field_map["detectedSpam"] = "int"
field_map["failedPasswords"] = "hll"
newStringStatsDB("OneDayDB",3600,24, field_map)
ssdb = getStringStatsDB("OneDayDB")
ssdb:twEnableReplication()
ssdb:twSetMaxSize(1000000)
ssdb:twSetv4Prefix(32)
field_map = {}
field_map["failedPasswords"] = "hll"
field_map["mailboxesPerIp"] = "hll"
newStringStatsDB("OneHourDB",600,6, field_map)
ssdb = getStringStatsDB("OneHourDB")
ssdb:twEnableReplication()
ssdb:twSetMaxSize(500000)
ssdb:twSetv4Prefix(32)

During normal operation, there are occasionally error messages logged -

2023-02-03T06:00:06.718974+00:00 xm1wfd01 wforce[6507]: parseReceivedReplicationMsg: max sibling recv queue size (5000) reached - dropping replication msg

however the max receive queue size is not configurable

neilcook commented 1 year ago

So the syncDB flow looks like this:

So the first thing to look for when debugging this is the logs of the server sending the sync messages - there should be some logs about why the "/?command=syncDone" endpoint failed. If you can post those here then we can start to see what is going on. In fact anything in the logs of the sending server for the time period in question would be useful.

The second thing I would say is that I would not recommend using running wforce servers as syncDB servers. My recommendation would be to run separate wforce servers, which never get any traffic, but which are full cluster members, as syncDB servers. Why? Well you've seen for yourself what happens - when synchronising, the statsDBs get locked (in turn) and so replication cannot happen and thus you see replication messages backing up. This is much less of a problem on a server that isn't handling any queries. If they're running on separate hardware, then even better.

Incidentally the max receive queue size is configurable - the Lua function setMaxSiblingQueueSize() sets the receive and send queue size simultaneously.

neilcook commented 1 year ago

BTW I don't know if you're using sharded statsDBs but that will help with any effects of freezes during synchronisation, as only a single shard gets locked at a time rather than the entire DB.

newShardedStringStatsDB() is almost a drop-in replacement for newStringStatsDB().

neilcook commented 1 year ago

Hi, any update on this issue? Did you get any logging from the server sending the sync messages? That will contain the most useful hints as to why the server starting up stays in warmup mode, as I said there should be some logs about why the "/?command=syncDone" endpoint failed.

neilcook commented 1 year ago

Hi Steve, I'd like to debug this issue further but I can't without more information...

sshipway commented 10 months ago

Sorry - I'm not able to get much more in the way of logs for this, as it only occurs during a startup on our largest cluster and only infrequently :( We are not using the ShardedStringStatsDb but that looks like a good idea now that we've upgraded to 2.8.0 though it will mean changes to our configuration templates etc. I will try the recommended changes -

neilcook commented 1 week ago

Does this issue still happen or did the changes fix it?

sshipway commented 1 week ago

We are now using the sharded database, and the increased queue length, which appears to have fixed our other issue of dropped updates. I've not been able to obtain resources for a sync-source node so have not implemented that part. After these changes, the warmup time has dropped from 15min+ down to 3 minutes, so I would say this has solved the issue, and the sharded database should be recommended for any larger scale setup (we have 1000000 items in our table) Thanks!