estraier / tkrzw

a set of implementations of DBM
Apache License 2.0
169 stars 21 forks source link

Feature Request #44

Open miloszskalecki opened 1 year ago

miloszskalecki commented 1 year ago

Hi Mikio,

I really, really like your database and would like to ask if there is a remote chance to support a scenario where append operation on HashDBM instead of doing Read-Modify-Write would just do what update_mode=append does internally, which is insert a new linked list entry and update HashRecord's Head, and Rebuild would not erase/reclaim link list items. HasDBM.Get operation()s in new "update_mode=merge" mode would return collection of all items from the linked list for a given key. This would be equivalent to RocksDB merge operation but on a hash database. Currently we have to use combination of two databases TreeDBM and HashDBM to implement this scenario, but using TreeDBM has its drawbacks (log(n), need to use range iterations, tree rebalancing etc.) and we are keen to be close to O(1) (well in proposed mode this would be O(1) + enumeration of the link list). I'm happy to provide more details and looking forward to hear your opinion.

Best Regards

estraier commented 1 year ago

Hi,

It's an interesting idea. However adding a new update mode seems too much to me. If you'd like to get all update history of a record, a new method like the following can satisfy your goal.

Status ProcessRecordHistory(std:string_view key, RecordProcessor* proc);

An example usage would be like this:

class HistoryPrinter : public DBM::RecordProcessor { public: // To be called for an update to set a value. virtual std::string_view ProcessFull(std::string_view key, std::string_view value) { std::cout << key << " was set to " << value << std::endl; return NOOP; } // To be called for an update to remove the record. virtual std::string_view ProcessEmpty(std::string_view key) { std::cout << key << " was removed" << std::endl; return NOOP; } };

HistoryPrinter printer; dbm.ProcessRecordHistory("foobarbaz", &printer);

Methods of the given processor are called in the reverse chronological order of update operations. This could be implemented only for HashDBM (not TreeDBM).

I'm not sure how many people use this kind of niche features though. Could you tell me more about your use case?

miloszskalecki commented 1 year ago

Hi, Thank you for your prompt reply and I apologise for my delayed and long response. My use case is a classic time series problem of organising events into groups. Let's say I have a collection of events/items (i.e. financial chart candlesticks or price ticks or sensor readings) that due to sheer number of entries (billions) and other reasons cannot be stored as single kv entry per event/item but must be grouped into collection of items under single key. Let's say, we group events into time buckets, each bucket may contain at most max number of events within the same time period (i.e. 5minutes or 6hours or even 10 years). Logically, There are two types of entries, open group and closed group. Open group is the current time window in which you append events until window closes, after which time window becomes closed group that cannot be modified. There are two performance goals: first, we need ~O(1) for random closed group reads (ideally for open group as well). Second, append to open group should not do RMW, ideally just writes. Please also note time history is fixed, which means number of closed groups is constant (i.e. max 50 closed groups per financial instrument or market or sensor whatever) meaning oldest closed groups will be dropped from database, a moving time window. The complete group database size is between 1 and 1.5TB (up to 4 or more shards). I can see three approaches to implementation of the above without modifying the tkrzw's code: A) single HashDBM database / table, with update_mode=in-place, and append operation on open groups B) two databases / tables, TreeDBM for open groups, HashDBM for closed groups. Events are inserted (via set, or setmulti) to TreeDBM per each event, not grouped, key of each entry equals to event id. Once time window closes, iterate on TreeDBM is done to collect all events in this time window, then combined into single collection/entry serialized and inserted into HashDBM table as single kv, and moved TreeDBM entries removed C) variation of the B, but instead of TreeDBM for open groups, HashDBM is used (i.e. as circular buffer with position counter (via dedicated additional control kv entry) or simple modulo hash to translate event id into index. This is to avoid new key inserts and removals or rebalancing that is required in B.

Implementation A is far from ideal as it requires RMW, with growing number of items and constantly increasing value block size to accommodate extra item/event on each RMW/append (I think value block size would eventually saturate and stay at longest produced event list size). It would be ok-ish if db was small and didn't require Direct I/O (so it would be cached in mem and occasionally sync()-ed). but unfortunately in our case TreeDBM db size can exceed avail mem.

B is better as events are just inserted to TreeDBM, but as with any trees, log(n), they need to be rebalanced, multiple pages updated to insert / remove a key, and there is this additional code we have to maintain to iterate, combine, copy to another table and then remove from TreeDBM

C is close to what we need performance wise but again it requires this additional code to assemble events from open group, copy into another table, plus handling some interesting side effects

What I'm proposing is new HashDBM merge operation that would append without RMW, and employ what update_mode=append mode does internally, puts new value on top of the linked list. Get (key) would return all appended values as a list. A bonus feature would be a full merge to materialize the linked list into single block of data. I will try to get closer look at tkrzw code to see if there is a way to implement what we're looking for (i.e. with RecordProcessor) or extending tkrzw code would be a better solution.

I appreciate your feedback

estraier commented 1 year ago

Hi,

Let me clarify my idea in my first reply. As you know, HashDBM in appending mode records all update operations in the file as linked lists. So, the following code stores ("foo", "event01"), ("foo", "event02") in the file.

HashDBM dbm; dbm.Open(...); dbm.Set("foo", "event01"); dbm.Set("foo", "event02");

Then, if a method like "ProcessRecordHistory" accepts arbitrary callback functions, you can get all past events of the same key in the reverse order. The following code would print "foo was set to event02" and "foo was set to event01".

class HistoryPrinter : public DBM::RecordProcessor { public: // To be called for an update to set a value. virtual std::string_view ProcessFull(std::string_view key, std::string_view value) { std::cout << key << " was set to " << value << std::endl; return NOOP; } // To be called for an update to remove the record. virtual std::string_view ProcessEmpty(std::string_view key) { std::cout << key << " was removed" << std::endl; return NOOP; } };

HistoryPrinter printer; dbm.ProcessRecordHistory("foo", &printer);

I guess this could satisfy your requirement. However, please not that the above ProcessRecordHistory method causes a random read access over the whole file, which is slow if the file is huge.

miloszskalecki commented 7 months ago

Hi Mikio

Apologies for going quiet. What we're planning to do is to modify Hash DBM code to implement efficient append, in similar fashion it is implemented in TinyDBM: instead of GET/SET, append at the value offset if appended value can fit in current record space or create new resized record and copy+append. I also created C#/.NET Tkrzw bindings and would be happy to share if you are interested.

If I haven't said it already, I'm really impressed by how Tkrzw library is designed and implemented

Regards,

Milosz

estraier commented 7 months ago

Did you read my previous response?

I guess that retrieving past values is the most efficient way to realize appendable list on HashDBM's appending mode.

dbm.Set("foo", "abc"); dbm.Set("foo", "def"); auto& trace = dbm.Trace("foo"); string value; while (trace.Get(&value)) { cout << value << endl; }

If this code prints "abc" and "def", you can regard the record as a list, can't you?

On Tue, Mar 12, 2024 at 10:02 PM miloszskalecki @.***> wrote:

Hi Mikio

Apologies for going quiet. What we're planning to do is to modify Hash DBM code to implement efficient append, in similar fashion it is implemented in TinyDBM: instead of GET/SET, append at the value offset if appended value can fit in current record space or create new resized record and copy+append. I also created C#/.NET Tkrzw bindings and would be happy to share if you are interested.

P.S. If I haven't said it already, I'm really impressed by how Tkrzw library is designed and implemented!

Regards,

Milosz

— Reply to this email directly, view it on GitHub https://github.com/estraier/tkrzw/issues/44#issuecomment-1991602169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQGJVRCIOCFZ7Y426NV7DC3YX34FRAVCNFSM6AAAAAA5X6UCMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJRGYYDEMJWHE . You are receiving this because you commented.Message ID: @.***>

miloszskalecki commented 7 months ago

Hi Mikio,

Yes, I did. I also created a small prototype to test it and there are two side effects to this approach:

Unfortunately, frequent rebuilds are not an option for us. I went back to drawing board, spent quite some time going through Tkrzw code and realized current implementation of the HashDBM is already close to what we need, and perhaps the only thing missing is small modification to in-place Append to work in the same way TinyDBM append works now (TinyRecord::ReserializeAppend() to be precise, which is append at the key+value data end)

To recap. Essentially. this is what we do (I'm going to use non-financial example, in our case measurements are price candles): Let's say we have 100 threads persisting measurements for 500 sensors each, giving us total of 50k sensors producing a measurement once per second. Single Measurement payload is 100B. Last 24h of measurement history is kept per sensor.

Requirements:

  1. organize measurements into 5-minute blocks per sensor for efficient delivery, no one is interested in reading single measurement, it always is 5 minute worth of measurements per sensor
  2. block read should be O(1), via single GET call, no TreeDBM iteration nor HasDBM multiple random reads to assemble measurement block data, single block data should be read sequentially from disk
  3. no database rebuilds
  4. append a measurement to current open 5 minute block should be single write at the end of the block, current in-place append is implemented as GET, append data to the end and SET = NxN efficiency
  5. No hash record resizes should be done on append, ideally we want to pre-allocate space for full block on record create (this is already possible to achieve with align_pow)
  6. Use block free block pool, once the history saturates, for one new measurement block created the oldest (24h hist) will be dropped and should returned back to free block pool

Capacity stats based on above:

I created a prototype, the hash database params (from top of my head) = num_buckets=30M,align_pow=15,cache_buckets=1,fbp_count=100 free block pool count is same as number of threads and all free blocks will be serialized on db close (serialized slot count is according to code and doc equal to 1008 / (OffsetWidth + 4))

It works as expected, once the history saturates (each sensor reports measurements for at least 24h - obviously prototype can simulates it by generating dummy data) free block pool works as expected and the database behaves as ring buffer and does not grow.

The only spanner in the works is that in-place appends for each measurement result in NxN efficiency. I can amortize it by appending every now and often, i.e. throttle appends to single append of 10 measurements. One of our C++ devs will take a look and see if we can modify hash dbm the code ourselves to see if it is possible to add efficient append functionality, and perhaps, raise a PR as a feature suggestion for you to take a look.

Let us investigate that possibility and come back to you once we have something concrete.

Kind Regards