Silently stuck with all possible connection errors

route commented 6 months ago

Periodically our KeyDB single instance server is stuck. It works well, but out of the blue under pressure one of its cores is at 100% and clients disconnect with these errors:

244676:244689:M 14 Mar 2024 06:12:22.484 - Error writing to client: Broken pipe
244676:244689:M 14 Mar 2024 06:19:59.999 - Error writing to client: Connection reset by peer

When you try to connect to the server most likely you'll get timeout. It doesn't crash but it's not working either. The only suspicious thing I found so far in logs with debug mode on is NOTICE: Suboptimal snapshot.

Update: I turned off snapshotting for now and looks like it was the reason for a server to behave this way. Is this normal to have more than one line of Suboptimal snapshot? Look at the timestamps:

556180:556192:M 14 Mar 2024 17:27:01.009 * 10000 changes in 60 seconds. Saving...
556180:556192:M 14 Mar 2024 17:27:01.065 * Background saving started by pid 558765
556180:556192:M 14 Mar 2024 17:27:01.065 * Background saving started
558765:556192:C 14 Mar 2024 17:27:01.102 - RDB: 9 MB of memory used by copy-on-write
558765:556192:C 14 Mar 2024 17:27:09.051 - RDB: 158 MB of memory used by copy-on-write
558765:556192:C 14 Mar 2024 17:27:09.214 * DB saved on disk
558765:556192:C 14 Mar 2024 17:27:09.252 * RDB: 158 MB of memory used by copy-on-write
556180:556192:M 14 Mar 2024 17:27:09.339 * Background saving terminated with success
556180:556192:M 14 Mar 2024 17:28:10.008 * 10000 changes in 60 seconds. Saving...
556180:556192:M 14 Mar 2024 17:28:10.064 * Background saving started by pid 558766
556180:556192:M 14 Mar 2024 17:28:10.064 * Background saving started
558766:556192:C 14 Mar 2024 17:28:10.101 - RDB: 9 MB of memory used by copy-on-write
558766:556192:C 14 Mar 2024 17:28:17.853 - RDB: 156 MB of memory used by copy-on-write
558766:556192:C 14 Mar 2024 17:28:18.008 * DB saved on disk
558766:556192:C 14 Mar 2024 17:28:18.045 * RDB: 157 MB of memory used by copy-on-write
556180:556192:M 14 Mar 2024 17:28:18.131 * Background saving terminated with success
556180:556192:M 14 Mar 2024 17:28:25.686 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.687 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.688 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.695 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.697 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.704 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.705 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.706 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.707 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.711 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.733 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.734 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.735 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.736 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.737 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.794 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.796 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.797 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.836 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.837 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:25.838 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.390 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.391 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.394 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.395 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.396 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.397 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.407 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.438 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.440 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.655 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.656 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.668 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:28:26.676 - NOTICE: Suboptimal snapshot
556180:556192:M 14 Mar 2024 17:29:19.002 * 10000 changes in 60 seconds. Saving...
556180:556192:M 14 Mar 2024 17:29:19.058 * Background saving started by pid 558768
556180:556192:M 14 Mar 2024 17:29:19.058 * Background saving started
558768:556192:C 14 Mar 2024 17:29:19.096 - RDB: 11 MB of memory used by copy-on-write
558768:556192:C 14 Mar 2024 17:29:26.747 - RDB: 161 MB of memory used by copy-on-write
558768:556192:C 14 Mar 2024 17:29:26.925 * DB saved on disk
558768:556192:C 14 Mar 2024 17:29:26.963 * RDB: 161 MB of memory used by copy-on-write
556180:556192:M 14 Mar 2024 17:29:27.048 * Background saving terminated with success
556180:556192:M 14 Mar 2024 17:30:28.009 * 10000 changes in 60 seconds. Saving...
556180:556192:M 14 Mar 2024 17:30:28.066 * Background saving started by pid 558769
556180:556192:M 14 Mar 2024 17:30:28.066 * Background saving started
558769:556192:C 14 Mar 2024 17:30:28.103 - RDB: 11 MB of memory used by copy-on-write
558769:556192:C 14 Mar 2024 17:30:36.070 - RDB: 159 MB of memory used by copy-on-write
558769:556192:C 14 Mar 2024 17:30:36.230 * DB saved on disk
558769:556192:C 14 Mar 2024 17:30:36.267 * RDB: 160 MB of memory used by copy-on-write
556180:556192:M 14 Mar 2024 17:30:36.352 * Background saving terminated with success

My current idea is that snapshots for some reason are layered on top of each other? I just can't explain why so many Suboptimal snapshot lines within one second close to each other if we call

const redisDbPersistentDataSnapshot *redisDbPersistentData::createSnapshot(uint64_t mvccCheckpoint, bool fOptional)

once and before that we check that there's no currently running child.

keithchew commented 6 months ago

Hi @route

The code which produces the log is here:

    if (dictIsRehashing(spdb->m_pdict) || dictIsRehashing(spdb->m_pdictTombstone)) {
        serverLog(LL_VERBOSE, "NOTICE: Suboptimal snapshot");
    }

It just means KeyDB is in the middle of rehashing, and logs that is being saved is not optimal. It looks like you have continuous changes ie 10K changes every minute which triggers the save. Perhaps you can reduce this frequency from the config, eg removing "save 60 10000".

To determine what is taking up CPU, you can run flamegraph. Another thing to try, if it is due to rehashing, you can also try to disable active rehashing, ie using config "activerehashing no".

route commented 6 months ago

@keithchew thanks for reply! Yes I figured out that with the pressure we have a lot of changes which triggers snapshot. We have already turned off snapshot completely to see the impact, and looks like our database can breathe now.

I'm still a bit confused how it logs 34 lines "NOTICE: Suboptimal snapshot" within 2 seconds other than calling createSnapshot function 34 times, I don't see any loop in there. Which looks a bit scary to me, but maybe I'm wrong, C++ is not my best skill. Also it's a bit unclear to me how a snapshot which is done in a fork can have an impact on the whole db being stuck.

Thanks for a flamegraph, I'll try to benchmark it.

keithchew commented 6 months ago

You will need to add some logs in the code to see where createSnapshot() is being called from in your scenario. It looks like this method is being called from many places, and most of them loop though all DBs, eg:

    for (int i=0; i<cserver.dbnum; i++) {
        backup->dbarray[i] = g_pserver->db[i]->createSnapshot(LLONG_MAX, false);
    }

Having said that, from your logs, I don't think it is the cause of high CPU usage (but do verify with flamegraph), as only 1 process ends up being forked for the save to file...

Snapchat / KeyDB

Silently stuck with all possible connection errors #796