Spreads / Spreads.LMDB

Low-level zero-overhead and the fastest LMDB .NET wrapper with some additional native methods useful for Spreads
http://docs.dataspreads.io/spreads/libs/lmdb/api/README.html
Mozilla Public License 2.0
80 stars 9 forks source link

DUPSORT and DUPFIXED do not work #19

Closed aliostad closed 5 years ago

aliostad commented 5 years ago

HIi,

I am unable to store multiple values against a key. This is a basic operation, am I missing something?

var env = LMDBEnvironment.Create("../../../../../lmdb7");
env.Open();
var stat = env.GetStat();
var key = 10000;
Console.WriteLine("start");
var t = Environment.TickCount;
using (var db = env.OpenDatabase("first_db2", new DatabaseConfig(DbFlags.DuplicatesFixed )))
{
    db.Truncate();
    for (var i = 1; i < 10000; i++)
    {
        db.Put(0, Interlocked.Increment(ref key), TransactionPutOptions.AppendData);
    }
}
Console.WriteLine(Environment.TickCount -t);

I am getting MDB_KEYEXIST: Key/data pair already exists all the time. I also tried it with DuplicatesSort to no avail.

buybackoff commented 5 years ago

It definitely works. How about TransactionPutOptions.AppendDuplicateData instead of TransactionPutOptions.AppendData?

buybackoff commented 5 years ago

This is a core functionality for my use case and this lib even offers some advanced features out of the box. E.g. you could compare dupsorted by first 16/24/32/64/96/128 bits as uints which effectively gives you multi-key support. This is done very efficiently on C side without marshalling from C to a custom comparer function in C#.

See https://github.com/Spreads/Spreads.LMDB/blob/4f3b4ba78ee171ae260c659110c42df56d990ee9/src/Spreads.LMDB/Database.cs#L47-L85

aliostad commented 5 years ago

Well I have sent a repro which does not work. 🤷‍♂️ So should I use transaction instead?

PS:

I understand this has evolved and I can see a lot of commented codes and tests. I am grateful that you have shared it as I appreciate all the work that has gone into going down to the metal to make it super performant.

buybackoff commented 5 years ago

You use TransactionPutOptions.AppendData but for dupsorted you need TransactionPutOptions.AppendDuplicateData

buybackoff commented 5 years ago

Well I have sent a repro which does not work.

Before answering I double-checked on my code that it does work both for db and cursors.

Whenever you have unexpected behavior keep this open

http://www.lmdb.tech/doc/group__mdb.html

and read very carefully every tiny detail as if this is you employment or marriage contract ;)

But even the docs are not complete. E.g. cursor operations for dupsorted depend on a cursor being already positioned at the key via CursorGetOption.Set. Etc...

buybackoff commented 5 years ago

So should I use transaction instead?

You are using and overload without a transaction. If it doesn't work with TransactionPutOptions.AppendDuplicateData then it is a bug, but it should be in tests.

Transactionless overloads are often handy but actually they are less performant (for reads at least) because we cache and reset/renew read-only transactions.

Savings for transactionless db.Put are very minimal (two P/Invoke call overheads) so in general prefer overloads with transactions. .NET objects for transactions are pooled so there will be no allocations from that. Also transactionless overloads cannot be used inside transactions obviously.

buybackoff commented 5 years ago

See https://github.com/Spreads/Spreads.LMDB/blob/4f3b4ba78ee171ae260c659110c42df56d990ee9/test/Spreads.LMDB.Tests/LMDBTests.cs#L677

It is a part of CI and the test passes with the transactionless overload.

buybackoff commented 5 years ago

I understand this has evolved and I can see a lot of commented codes and tests. I am grateful that you have shared it as I appreciate all the work that has gone into going down to the metal to make it super performant.

Thanks!

Cleanup is definitely needed, but most parts are quite stable. Important missing part is only #18.

aliostad commented 5 years ago

OK I found out. Change the count in that test from 10 to 1000... boom! It seems limit is 256 items. Not sure if this is by design.

My guess is only the first byte is stored or counted for checking duplication.

buybackoff commented 5 years ago

Change the count in that test from 10 to 1000... boom!

It works on my machine with even 10000

aliostad commented 5 years ago

You are using IntergerDuplicate. I am using DuplicateSort or DuplicateFixed. Does not work with them. (It seems only storing the first byte for those or something around that)

        public async Task CouldDeleteDupSorted()
        {
            var path = TestUtils.GetPath();
            var env = LMDBEnvironment.Create(path);

            env.MapSize = 100 * 1024 * 1024;
            env.Open();

            var db = env.OpenDatabase("dupfixed_db",
                new DatabaseConfig(DbFlags.Create | DbFlags.DuplicatesFixed));
            db.Truncate();

            var count = 1000;

            for (var i = 1; i <= count; i++)
            {
                try
                {
                    db.Put(0, i, TransactionPutOptions.AppendDuplicateData);
                }
                catch (Exception e)
                {
                    Console.WriteLine(e.ToString());
                }
            }

            using (var txn = env.BeginReadOnlyTransaction())
            {
                Assert.AreEqual(1, db.AsEnumerable<int, int>(txn).Count());
                foreach (var kvp in db.AsEnumerable<int, int>(txn))
                {
                    Console.WriteLine($"kvp: {kvp.Key} - {kvp.Value}");
                }

                Assert.AreEqual(10, db.AsEnumerable<int, int>(txn, 0).Count());
                foreach (var value in db.AsEnumerable<int, int>(txn, 0))
                {
                    Console.WriteLine("Key0 value: " + value);
                }
            }

            env.Write(txn =>
            {
                db.Delete(txn, 0, 5);
                txn.Commit();
            });
            Console.WriteLine("AFTER DELETE SINGLE DUPSORT");
            using (var txn = env.BeginReadOnlyTransaction())
            {
                Assert.AreEqual(1, db.AsEnumerable<int, int>(txn).Count());
                foreach (var kvp in db.AsEnumerable<int, int>(txn))
                {
                    Console.WriteLine($"kvp: {kvp.Key} - {kvp.Value}");
                }

                Assert.AreEqual(9, db.AsEnumerable<int, int>(txn, 0).Count());
                foreach (var value in db.AsEnumerable<int, int>(txn, 0))
                {
                    Console.WriteLine("Key0 value: " + value);
                }
            }

            env.Write(txn =>
            {
                db.Delete(txn, 0);
                txn.Commit();
            });

            Console.WriteLine("AFTER DELETE ALL DUPSORT");
            using (var txn = env.BeginReadOnlyTransaction())
            {
                Assert.AreEqual(0, db.AsEnumerable<int, int>(txn).Count());
                foreach (var kvp in db.AsEnumerable<int, int>(txn))
                {
                    Console.WriteLine($"kvp: {kvp.Key} - {kvp.Value}");
                }

                Assert.AreEqual(0, db.AsEnumerable<int, int>(txn, 0).Count());
                foreach (var value in db.AsEnumerable<int, int>(txn, 0))
                {
                    Console.WriteLine("Key0 value: " + value);
                }
            }
            db.Dispose();
            await env.Close();
        }
buybackoff commented 5 years ago

Value 255 in little endian: 255 0 0 0 Value 256 in little endian: 0 1 0 0

LMDB compares keys as byte strings. If you do not set IntegerDuplicates then comparison happens by bytes, and value 256 is smaller then 255 with that comparison.

Then read the docs:

MDB_APPEND - append the given key/data pair to the end of the database. No key comparisons are performed. This option allows fast bulk loading when keys are already known to be in the correct order. Loading unsorted keys with this flag will cause a MDB_KEYEXIST error. MDB_APPENDDUP - as above, but for sorted dup data.

IntegerDuplicates flag exists for a reason. It automatically detects endianness if you happen to be on a big one.

aliostad commented 5 years ago

I see!

Thanks a lot, I see that this has an extra bitflag that makes it work.

Great work and thanks again.

buybackoff commented 5 years ago

You could use DirectBuffer that supports arbitrary bytes length.

The example uses UTF8. If you are OK with UTF16 then just cast keyPtr to byte* and use keyString.Length * 2 as byte length, no need to stackalloc a new buffer.

[Test]
        public unsafe void CouldWriteString()
        {
            var path = TestUtils.GetPath();
            var env = LMDBEnvironment.Create(path,
                DbEnvironmentFlags.WriteMap | DbEnvironmentFlags.NoSync);
            env.Open();
            var stat = env.GetStat();

            var db = env.OpenDatabase("first_db", new DatabaseConfig(DbFlags.Create));

            var keyString = "my_string_key";
            var values = new byte[] { 1, 2, 3, 4 };

            env.Write(txn =>
            {
                fixed (char* keyPtr = keyString)
                {
                    var keyUtf8Length = Encoding.UTF8.GetByteCount(keyString);
                    var keyBytes = stackalloc byte[keyUtf8Length];
                    Encoding.UTF8.GetBytes(keyPtr, keyString.Length, keyBytes, keyUtf8Length);
                    var key = new DirectBuffer(keyUtf8Length, keyBytes);

                    var value = new DirectBuffer(values);
                    DirectBuffer value2 = default;

                    using (var cursor = db.OpenCursor(txn))
                    {
                        Assert.IsTrue(cursor.TryPut(ref key, ref value, CursorPutOptions.None));
                    }

                    using (var cursor = db.OpenCursor(txn))
                    {
                        Assert.IsTrue(cursor.TryGet(ref key, ref value2, CursorGetOption.SetKey));
                    }

                    Assert.IsTrue(value2.Span.SequenceEqual(value.Span));

                    txn.Commit();
                }
            });
            db.Dispose();
            env.Close().Wait();
        }
aliostad commented 5 years ago

Thank you! Best to copy in the other issue I created hence it remains for others too.

buybackoff commented 5 years ago

BTW, note the using(var txn = env.BeginTransaction()){...} is faster than env.Write(txn => {...}). In tests I used lambda for async support, but if you need last bits of performance then async support should be disabled and using... is the way to go.

aliostad commented 5 years ago

Thanks! I was about to open a new issue on that... I found the single write takes 2-3ms which is really high. OK, I am gonna look at that.

buybackoff commented 5 years ago

Disable async completely is you do not need it. Usually it is not needed unless you do async calls from within transactions. From readme:

Async support is enabled by default, but could be switched off via LMDBEnvironment.Create(..., disableAsync: true); if not used.

buybackoff commented 5 years ago

Also by default all writes do 2 fsyncs, even on SSDs that could be several msecs. You need to play with environment flags. Async env.Write(txn => {...}) cannot take 2-3ms, you probably using default flags.

There is no silver bullet and default fully-sync writes are slow in LMDB as well. It is reads and async/no_sync modes that make it so cool.

Read very very carefully this docs on mdb_env_open flags, every word there is important: http://www.lmdb.tech/doc/group__mdb.html#ga32a193c6bf4d7d5c5d579e71f22e9340

aliostad commented 5 years ago

Thanks. Will do. I had some other problems as well, opening transaction in the loop 🤦‍♂️

buybackoff commented 5 years ago

opening transaction in the loop

Always commit or abort txn before disposing. Also if you are using cursors dispose them before committing and disposing txn.

usually it goes like:

using(var txn = env.BeginTransaction())
{
    try
    {
          ....
          txn.Commit();
    }
    catch
    {
          txn.Abort();
          throw;
    }
}

Not that this library has native binary than does not allocate full data base file on disk on Windows (master branch). Used size is increasing on demand. You could create 1TB database while having just several MB free disk space. Other .NET libraries usually come with release branch that allocates the entire file, which is faster but less convenient.

I could bet money that this library is the fastest from .NET overheads point of view, but the native binary must be the same to make a fair comparison. I've got an email with a comment that is now deleted, there you had this line:

 tx.Put(db, BitConverter.GetBytes(0), BitConverter.GetBytes(i), LightningDB.PutOptions.AppendDuplicateData);

It allocates 2 byte arrays and does conversion work, but it is still many orders of magnitude faster than IO. With master branch every IO operation must increase file size on disk. Try updating the same value vs appending then comparison will be more fair since no new space will be needed on every operation.

aliostad commented 5 years ago

Thanks a lot. Has been a lot of help. Hopefully I should be on my way to build my stuff.

buybackoff commented 5 years ago

@aliostad Is your production machine Windows? See #22

aliostad commented 5 years ago

Thanks. Could be either linux or windows. Building a Raft implementation and need LMDB for log and state.