CoreyKaylor / Lightning.NET

.NET library for LMDB key-value store
Other
397 stars 82 forks source link

Unclear if DupSort is functional. #128

Closed danielcrenna closed 3 years ago

danielcrenna commented 3 years ago

Hello,

I've checked the unit tests for a working example of using dupsort to insert multiple records under the same key. So far, the only behaviour I can see is that all subsequent Put operations, even if using DupSort and the Append option, are ignored, and only the first entry created under the key is found.

My use case is inserting multiple "master" ID keys under the same "type" key, for a count-by-type or fetch-by-type pattern, where the cursor should enumerate multiple values and each value is a unique key back to the detail record.

I suppose a sorted incrementing value could be achieved, for unique keys, but then I'm stuck enumerating all prior values or trusting a persistent value, which both have other drawbacks.

AlgorithmsAreCool commented 3 years ago

@danielcrenna I can take a look, but probably not until Friday (work has been very very worky).

danielcrenna commented 3 years ago

@AlgorithmsAreCool no rush, I already switched to unique keys for this use case, and it works fine. When work isn't worky on my end I will add a unit test to this issue to demonstrate the actual behaviour vs. expected.

AlgorithmsAreCool commented 3 years ago

Taking a look at this now

AlgorithmsAreCool commented 3 years ago

@danielcrenna

Dupsort is a confusing feature, but it is working.

You are on the right track creating the database with the correct flags. The part where I think you are tripping up is on insertion and retrieval.

When a lmdb database is in DupSort mode, the Put operation appends data records by default. That is to say, calling Put twice with the same key will create two records.

See docs here : http://www.lmdb.tech/doc/group__mdb.html#ga4fa8573d9236d54687c61827ebf8cac0

The PutOptions.AppendData flag is only used when you are bulk loading keys into the database. If you have a bunch of already sorted keys, you can use this flag to insert items faster. It causes the engine to jump to the end of the database instead of performance a normal seek, which would typically need to traverse multiple nodes of the B-Tree.

GetMultiple is another special feature, but it may only be used when a database in DUPFIXED mode. DUPFIXED means that the values in a database are all the same length. If your data is all the same length, DUPFIXED + GetMultiple can help you grab multiple values in single call, which can improve performance is you have lots of values per key.

If you are not in DUPFIXED mode, the only way to retrieve multiple values for a key, is to use the cursor api. First you call LightningCursor.Set() to move the cursor to the key you want, then use a do...while loop to move to the next values by calling LightningCursor.NextDuplicate().

Here is a little example

var cursor = tx.CreateCursor(db);

cursor.Set(Encoding.UTF8.GetBytes("Key1"));

do {
    var data = cursor.GetCurrent();

    var currentKey = Encoding.UTF8.GetString(data.key.CopyToNewArray());
    var currentValue = Encoding.UTF8.GetString(data.value.CopyToNewArray());

    Console.WriteLine($"Current Key = {currentKey}, Value = {currentValue}");

} while (cursor.NextDuplicate() == MDBResultCode.Success);

Let me know if you have more trouble or questions.

danielcrenna commented 3 years ago

@AlgorithmsAreCool

Thanks for taking the time to provide a detailed answer.

Yes, I misunderstood the cursor API. I took NextDuplicate to mean something else entirely, as in, there are flags to prevent duplicates, elsewhere, so I assumed that the duplicates feature was specifically about disallowing appending data that has already appeared in the value for the key, literally duplication, as opposed to "duplication" in the sense of more-than-one.

Multiple vs. Duplicate in this case is a bit hard to follow, but I get it now.

Might be worth throwing your answer's code into the Wiki, as it's a useful detail that I couldn't mentally parse from reading the unit tests.

danielcrenna commented 3 years ago

@AlgorithmsAreCool Anecdotally, using separate keys that are prefixed, plus a counter turns out to be far more desirable than multi-valued keys, because LMDB has value size limitations when built in DUP_SORT mode.

I like the idea of never having to track a count and just taking the appended values as the canonical source of counting, but neither scenario avoids enumeration, so it's simpler for maintainers to avoid DUP_SORT in my project.

Thanks again for your help -- feel free to close the issue.

danielcrenna commented 3 years ago

@AlgorithmsAreCool Actually, I just noticed in more detail your comment about DUPFIXED, that seems to be the best performing option since I'm doing a Type->IDs lookup and then traversing the ID list to pull detail records, so your idea to use GetMultiple would be way more efficient in a counting scenario, since all the values are GUIDs, and therefore, of uniform length.

Very useful idea, thank you for sharing.

AlgorithmsAreCool commented 3 years ago

This library mostly copies LMDB names for concepts. This has the advantage of making it easier for people to read lmdb docs directly. But lmdb has some confusing and very fiddly apis, so maybe it is wash.

Your mention of the dup_sort value size limitation made me curious, and i found this comment in the lmdb source.

     *  Data items in an #MDB_DUPSORT database are also limited to
     *  this size, since they're actually keys of a sub-DB.  Keys and
     *  #MDB_DUPSORT data items must fit on a node in a regular page.

There is a way we can recompile lmdb to increase this beyond 511 bytes, but there are warnings in the code about back-compat so we probably won't touch this.