Support huge values in RdbLoader

romange commented 1 month ago

Currently RdbLoader::LoadKeyValPair loads all the data from a single entry, even if it's huge (set, map , zset, list etc). This creates the following problems:

A huge spike in memory during the load process (or in replica). Because we need to hold the load traces before we apply them on a shard.
The shard thread will stuck for many seconds while processing these traces (inside LoadItemsBuffer()), which in turn will block subsequent calls to FlushShardAsync and will block replication.
If fullsync replication is stalled on replica, it will eventually propagate to the master, blocking its snapshotting flow. At this point we can not distinguish between a deadlocked replication and a slow one.

This might trigger DflyCmd::BreakStalledFlowsInShard on the master side, effectively cancelling a possibly correct full sync. If replica would be more responsive the whole chain would disappear.

Suggestion: Change the loading code to streaming to support huge sets/lists etc. Allow adding multiple entries in parts. While the total CPU time for loading a huge entry will stay the same, the replica won't stall for dozens of seconds if the LoadTrace size will be capped.

romange commented 1 month ago

Related to his PR: https://github.com/dragonflydb/dragonfly/pull/3759 and created based on the learnings from this issue #3726

romange commented 1 month ago

In more detail: LoadKeyValPair loads all the blobs into memory (OpaqueObj) but does not create Valkey objects (i.e. hashes, strings etc). Instead, it delegates the object creation into the shard thread using FlushShardAsync function that takes batches of Items and creates an object from each one of them inside LoadItemsBuffer.

It works well but with huge objects (sets, for example) we need to load all their blobs into OpaqueObj and it takes up lots of memory and then later, stalls the process inside LoadItemsBuffer just because its CPU is busy creating millions of set fields from the OpaqueObj.

The suggestion here is to break the creation flow into parts. Nothing in the high level flow should change, I think but RdbLoaded::Item should be able to describe whether it's not the whole object but a part of it, and then LoadItemsBuffer will be able to "append" the additional parts to the already existing object.

Note, that right now the logic today - to treat "duplicate" keys during load with a logical error, which of course should change in case an Item is the continutation of the object that was broken down.

To summarize: we have a code that loads series of blobs into memory (ReadObj, ReadSet, ReadHMap etc) and we have code that creates Dragonfly data structures RdbLoaderBase::OpaqueObjLoader.

The first milestone for this issue would be to convert ReadSet, for example, into a streaming variant that adds one or more Items, and make sure that OpaqueObjLoader can handle this correctly.

dragonflydb / dragonfly

Support huge values in RdbLoader #3760