Closed romange closed 3 weeks ago
Related to his PR: https://github.com/dragonflydb/dragonfly/pull/3759 and created based on the learnings from this issue #3726
In more detail: LoadKeyValPair loads all the blobs into memory (OpaqueObj) but does not create Valkey objects (i.e. hashes, strings etc). Instead, it delegates the object creation into the shard thread using FlushShardAsync
function that takes batches of Items and creates an object from each one of them inside LoadItemsBuffer
.
It works well but with huge objects (sets, for example) we need to load all their blobs into OpaqueObj and it takes up lots of memory and then later, stalls the process inside LoadItemsBuffer just because its CPU is busy creating millions of set fields from the OpaqueObj.
The suggestion here is to break the creation flow into parts. Nothing in the high level flow should change, I think but RdbLoaded::Item
should be able to describe whether it's not the whole object but a part of it, and then LoadItemsBuffer
will be able to "append" the additional parts to the already existing object.
Note, that right now the logic today - to treat "duplicate" keys during load with a logical error, which of course should change in case an Item
is the continutation of the object that was broken down.
To summarize: we have a code that loads series of blobs into memory (ReadObj, ReadSet, ReadHMap etc) and we have code that creates Dragonfly data structures RdbLoaderBase::OpaqueObjLoader
.
The first milestone for this issue would be to convert ReadSet
, for example, into a streaming variant that adds one or more Items, and make sure that OpaqueObjLoader can handle this correctly.
Currently
RdbLoader::LoadKeyValPair
loads all the data from a single entry, even if it's huge (set, map , zset, list etc). This creates the following problems:FlushShardAsync
and will block replication.This might trigger
DflyCmd::BreakStalledFlowsInShard
on the master side, effectively cancelling a possibly correct full sync. If replica would be more responsive the whole chain would disappear.Suggestion: Change the loading code to streaming to support huge sets/lists etc. Allow adding multiple entries in parts. While the total CPU time for loading a huge entry will stay the same, the replica won't stall for dozens of seconds if the LoadTrace size will be capped.