Closed route closed 3 months ago
Hi @route
The code which produces the log is here:
if (dictIsRehashing(spdb->m_pdict) || dictIsRehashing(spdb->m_pdictTombstone)) {
serverLog(LL_VERBOSE, "NOTICE: Suboptimal snapshot");
}
It just means KeyDB is in the middle of rehashing, and logs that is being saved is not optimal. It looks like you have continuous changes ie 10K changes every minute which triggers the save. Perhaps you can reduce this frequency from the config, eg removing "save 60 10000".
To determine what is taking up CPU, you can run flamegraph. Another thing to try, if it is due to rehashing, you can also try to disable active rehashing, ie using config "activerehashing no".
@keithchew thanks for reply! Yes I figured out that with the pressure we have a lot of changes which triggers snapshot. We have already turned off snapshot completely to see the impact, and looks like our database can breathe now.
I'm still a bit confused how it logs 34 lines "NOTICE: Suboptimal snapshot" within 2 seconds other than calling createSnapshot
function 34 times, I don't see any loop in there. Which looks a bit scary to me, but maybe I'm wrong, C++ is not my best skill. Also it's a bit unclear to me how a snapshot which is done in a fork can have an impact on the whole db being stuck.
Thanks for a flamegraph, I'll try to benchmark it.
You will need to add some logs in the code to see where createSnapshot() is being called from in your scenario. It looks like this method is being called from many places, and most of them loop though all DBs, eg:
for (int i=0; i<cserver.dbnum; i++) {
backup->dbarray[i] = g_pserver->db[i]->createSnapshot(LLONG_MAX, false);
}
Having said that, from your logs, I don't think it is the cause of high CPU usage (but do verify with flamegraph), as only 1 process ends up being forked for the save to file...
Periodically our KeyDB single instance server is stuck. It works well, but out of the blue under pressure one of its cores is at 100% and clients disconnect with these errors:
When you try to connect to the server most likely you'll get timeout. It doesn't crash but it's not working either. The only suspicious thing I found so far in logs with debug mode on is
NOTICE: Suboptimal snapshot
.Update: I turned off snapshotting for now and looks like it was the reason for a server to behave this way. Is this normal to have more than one line of
Suboptimal snapshot
? Look at the timestamps:My current idea is that snapshots for some reason are layered on top of each other? I just can't explain why so many
Suboptimal snapshot
lines within one second close to each other if we callonce and before that we check that there's no currently running child.