CoreyKaylor / Lightning.NET

.NET library for LMDB key-value store
Other
397 stars 82 forks source link

Process crash when iteratively writing 'batches' of values via cursor in transaction #155

Closed 4deeptech closed 1 year ago

4deeptech commented 1 year ago

This code enumerates and PUTS via cursor


        public void Append(IEnumerable<EventEnvelope> envList)
        {
            long version = 0;
            using (var tx = _env.BeginTransaction())
            using (var db = tx.OpenDatabase(_dbName, new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }))
            {
                var (resultCode, versionKey, versionValue) = tx.Get(db, Encoding.UTF8.GetBytes("masterStreamVersion"));
                if (resultCode == MDBResultCode.NotFound)
                {
                    version = 0;
                }
                else if (resultCode == MDBResultCode.Success)
                {
                    version = BitConverter.ToInt64(versionValue.CopyToNewArray(), 0);
                }
                using var cursor = tx.CreateCursor(db);
                {
                    foreach (EventEnvelope env in envList)
                    {
                        version += 1;
                        var eventResult = cursor.Put(Encoding.UTF8.GetBytes("masterStream_" + version), 
                            Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(env, _serializxerSettings)),
                            CursorPutOptions.None);
                        if (eventResult != MDBResultCode.Success)
                        {
                            tx.Abort();
                            throw new Exception("LightningDB cursor put exception");
                        }
                    }
                    var versionResult = cursor.Put(Encoding.UTF8.GetBytes("masterStreamVersion"), BitConverter.GetBytes(version), CursorPutOptions.None);
                    if (versionResult != MDBResultCode.Success)
                    {
                        tx.Abort();
                        throw new Exception("LightningDB write version exception");
                    }
                    var trxResult = tx.Commit();
                    if (trxResult != MDBResultCode.Success)
                    {
                        throw new Exception("LightningDB append commit exception");
                    }
                }
            }
        }

If I call this Append method in a loop 3K times with about 15 messages in the enumeration it will crash the process. However, If I call it once with say 50K items it works fine. As a sanity check I ran the loop 2K times and that succeeded.

This is my first time using LightningDB/lmdb so I may be doing something wrong. There is a key/value containing a version number that is maintained/set after each 'batch' written in the same transaction. Note that the size of the DB is the same in both scenarios and the looping is all done in the same thread like this:

for(int i=0;i<3000;i++)
{
    Append(arrayWith15OrSoItems); //crashes around iteration 2040 in a unit test
}

vs

Append(arrayWith50000Items); //works fine with same type payload in a unit test
AlgorithmsAreCool commented 1 year ago

Interesting. From a quick glance this code looks correct. I'll take a deeper look at this today or tomorrow.

What kind of exception is being thrown?

4deeptech commented 1 year ago

No dotnet exceptions thrown, just crashes the process. When running as an xunit test it shows the following:

"The active test run was aborted. Reason: Test host process crashed."

AlgorithmsAreCool commented 1 year ago

Lemme try to repro.

Btw, what OS is this happening on?

AlgorithmsAreCool commented 1 year ago

Confirmed a repro. My test console application dies after 2039 iterations of calling append() with no observable exception.

4deeptech commented 1 year ago

Wow, that's exactly where mine was happening. I apologize for not putting the OS details and version. Windows 11, .NET 6, LightningDB 0.14.1. Thanks for spinning up a repro sample. I guess this is a good test case to include into the test suite. I guess I'm not crazy and doing something wrong ;) Wonder if this happens on other OS's. What I was working on may end up running in a linux VM at some point but way too early in my prototype for that.

AlgorithmsAreCool commented 1 year ago

No, it's definitely not just you. There is a crash happening deep in the call stack internal to LMDB. In fact it is faulting in ntdll, probably due to some type of resource exhaustion.

I can see the following error in windows event viewer

Faulting application name: lightningnet-155.exe, version: 1.0.0.0, time stamp: 0x634eda42
Faulting module name: ntdll.dll, version: 10.0.19041.2130, time stamp: 0xb5ced1c6
Exception code: 0xc0000374
Fault offset: 0x00000000000ff6a9
Faulting process id: 0x5e40
Faulting application start time: 0x01d926da7e3c35a2
Faulting application path: D:\dev\lightningnet-155\bin\Debug\net7.0\lightningnet-155.exe
Faulting module path: C:\windows\SYSTEM32\ntdll.dll
Report Id: 4b2b1262-b76b-4923-bf05-f47c1164b6ae
Faulting package full name: 
Faulting package-relative application ID: 

But what is very weird is the program crashes, but I can't observe the exception in VS2022. But when i use WinDBG, the program does not crash and concludes 5000 iteractions successfully. I am still trying to determined what the issue is exactly.

Historically, LMDB is extremely reliable as long as it is used correctly, so i would bet this is a problem with a invalid config. But I still trying to figure out what exactly...

AlgorithmsAreCool commented 1 year ago

Observations

4deeptech commented 1 year ago

My first attempt at using the LightningDB.net I had it writing one item at a time and didn't get good throughput. So I looked in the unit tests to see if there was a bulk way to get data in and saw the cursor approach so I tried that and got better throughput in my unit tests and thought I was good to go. Then when I wired this into my prototype project it went to hell once I ramped up the number of items in my simulation. I have an Akka actor sending the chunks of items to insert over to the writer actor. Poof, went up in smoke.

How does the transaction work if you don't commit?

Is there a better way to batch stuff without using cursors?

AlgorithmsAreCool commented 1 year ago

@4deeptech Yep, you only need to use cursors for some of the fancier features like assigning multiple values to a single key (which can be pretty useful for advanced scenarios).

To perform a batch insert simply create a transaction and then call tx.Put(db, ...) as many times as needed and then call tx.Commit(). You can get pretty great write performance this way, millions of records per second depending on disk performance. You can use some documented but dangerous hacks to get faster, but they introduce the possibility of corruption if there is a crash, so they are not recommended.

On the up side, avoiding the use of cursors fixes this issue and In can make unbounded writes as long as the map size is adequate.

I am still trying to pin down the current issue, it is related to disposal of the database handle. I'm wondering if there is a leak somewhere.

AlgorithmsAreCool commented 1 year ago

How does the transaction work if you don't commit?

If do not commit, the data written during the transaction is voided and released when the transaction is disposed. You MUST dispose the transaction in all cases. It will be obvious if you fail to do so because there is only 1 write transaction allowed at a time. This means writes to LMDB are effectively single threaded (guarded by LMDB internal locking). In practice it isn't a big deal for most scenarios because the write latency so we don't care too much, but some very write intensive workloads would benefit from use of a LSM-tree based database or an append-only style database like FASTER.

AlgorithmsAreCool commented 1 year ago

After some fiddling, I managed to get WinDBG to hook the process mid crash. It looks like a genuine native LMDB bug involving a corrupt memory heap. I'll work on minimizing a repro and reporting this upstream.

PROCESS_NAME:  lightningnet-155.exe

ERROR_CODE: (NTSTATUS) 0xc0000374 - A heap has been corrupted.

SYMBOL_NAME:  ntdll!RtlpFreeHeapInternal+0

MODULE_NAME: ntdll

IMAGE_NAME:  ntdll.dll

FAILURE_BUCKET_ID:  HEAP_CORRUPTION_ACTIONABLE_BlockNotBusy_DOUBLE_FREE_c0000374_ntdll.dll!RtlpFreeHeapInternal

FAILURE_ID_HASH:  {f9e860eb-b03f-7415-804c-7e671e26c730}

@4deeptech In the meantime, the recommended workaround is to not use cursors. It sounds like you just need to do some bulk writes. As I mentioned above, LMDB transactions support an unbounded number of Put operations per-transaction. So you don't need to use a cursor to fulfill this use case.

Another side note, you don't actually need to re-open the database handle each iteration. You can safely open a handle once per application and then reuse it for all transactions. This has a positive impact on performance. Each time you open a DB handle in LMDB, it performs a linear scan of all the databases in the environment. If your environment has lots of databases (dozens or more...) the time taken to open a DB starts to add up.

There are some other optimization opportunities i see in your example code pertaining to string operations and memory allocations, but idk how much of a performance win they would be compared to the IO operation.

Let me know if you'd like some example code, I'd be happy to help.

AlgorithmsAreCool commented 1 year ago

@4deeptech Here is a minimal repro program of this bug. When you get some time could you run this in your environment to ensure it still crashes?

using System.Text;
using LightningDB;

//this is a single file formatted C# program

//executions begins here
Console.WriteLine("Hello, World!");

using var environment = CreateFreshEnvironment("C:/dev/temp");
environment.Open();

using var db = OpenDatabase(environment, "testDb");

//this loop fails after 2039 iterations
for (int i = 0; i < 5000; i++) {
    ReproCoreIteration(environment, db);
    Console.WriteLine($"Completed iteration {i}");
}

static void ReproCoreIteration(LightningEnvironment environment, LightningDatabase db)
{
    using var tx = environment.BeginTransaction();//auto-disposed at end of scope
    using var cursor = tx.CreateCursor(db); //auto-disposed at end of scope

    var guid = Guid.NewGuid().ToString();
    var guidBytes = Encoding.UTF8.GetBytes(guid);

    _ = cursor.Put(
        guidBytes,
        guidBytes,
        CursorPutOptions.None
    );

    var trxResult = tx.Commit();

    if (trxResult != MDBResultCode.Success)
        throw new Exception("LightningDB append commit exception");

    //program crashes here during transaction dispose after exactly 2039 iterations
    //Note : Commit() call is successful, failure happens after commit when transaction is disposed
}

static LightningDatabase OpenDatabase(LightningEnvironment environment, string dbName)
{
    using var tx = environment.BeginTransaction();
    var db = tx.OpenDatabase(dbName, new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create });

    tx.Commit(); //ensure database is created

    return db;
}

static LightningEnvironment CreateFreshEnvironment(string path)
{
    if (Directory.Exists(path))
    {
        Directory.Delete(path, true);
        Console.WriteLine("Cleaned up previous directory");
    }

    return new LightningEnvironment(path, new EnvironmentConfiguration {
        MapSize = 100_000_000,
        MaxDatabases = 2
    });
}

Project

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net7.0</TargetFramework> <!-- other runtimes should also repro -->
    <RootNamespace>lightningnet_155_minimal</RootNamespace>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
  </PropertyGroup>

<ItemGroup>
  <PackageReference Include="lightningdb" Version="0.14.1" />
</ItemGroup>

</Project>
4deeptech commented 1 year ago

... Completed iteration 2034 Completed iteration 2035 Completed iteration 2036 Completed iteration 2037 Completed iteration 2038

D:\bootstrap_projects\LightningDBErrorRepro\LightningDBErrorRepro\bin\Debug\net6.0\LightningDBErrorRepro.exe (process 3652) exited with code -1073740940.

AlgorithmsAreCool commented 1 year ago

I'm going to try updating the version of LMDB we depend on, but I am not hopeful about there being a fix already made since I have tried searching their bug tracker and haven't seen one that matches this description.

So this might take a while to resolve since it is an upstream fix.

4deeptech commented 1 year ago

@AlgorithmsAreCool Sounds good. Thanks for verifying the issue and suggesting a workable alternative!

pepelev commented 1 year ago

Hi!

I have done research on this issue and I think I know the root of the problem.

First of all, I set up Windows error reporting (the documentation says this doesn't work for .net applications, but it's not true) to get a crash dump. When the dump was collected, after fiddling with the symbols, I got the following picture of the stacktrace of the main thread: stack-trace

This looks like cursor freed twice.

Then i altered LightningCursor.Dispose(bool) to count free calls like this:

Code ```csharp private long freed; private void Dispose(bool disposing) { if (_handle == IntPtr.Zero) return; if (!disposing) throw new InvalidOperationException("The LightningCursor was not disposed and cannot be reliably dealt with from the finalizer"); Interlocked.Increment(ref freed); mdb_cursor_close(_handle); _handle = IntPtr.Zero; Transaction.Disposing -= Dispose; GC.SuppressFinalize(this); } ```

and collect dump again. The dump showed that the fatal call mdb_cursor_close(_handle); is the first in this instance of LightningCursor.

Another interesting thing is that if i dispose the cursor before committing the transaction, the application will not crash.

Code ```csharp Console.WriteLine("Hello, World!"); using var environment = CreateFreshEnvironment("C:/dev/temp"); environment.Open(); using var db = OpenDatabase(environment, "testDb"); //this loop fails after 2039 iterations for (int i = 0; i < 5000; i++) { ReproCoreIteration(environment, db); Console.WriteLine($"Completed iteration {i}"); } static void ReproCoreIteration(LightningEnvironment environment, LightningDatabase db) { using var tx = environment.BeginTransaction();//auto-disposed at end of scope using (var cursor = tx.CreateCursor(db)) // <------------------------------------------------------- { // <---------------------------------------------------------------------------------------------- var guid = Guid.NewGuid().ToString(); var guidBytes = Encoding.UTF8.GetBytes(guid); _ = cursor.Put( guidBytes, guidBytes, CursorPutOptions.None ); } // <---------------------------------------------------------------------------------------------- var trxResult = tx.Commit(); if (trxResult != MDBResultCode.Success) throw new Exception("LightningDB append commit exception"); //program crashes here during transaction dispose after exactly 2039 iterations //Note : Commit() call is successful, failure happens after commit when transaction is disposed } static LightningDatabase OpenDatabase(LightningEnvironment environment, string dbName) { using var tx = environment.BeginTransaction(); var db = tx.OpenDatabase(dbName, new DatabaseConfiguration { Flags = DatabaseOpenFlags.Create }); tx.Commit(); //ensure database is created return db; } static LightningEnvironment CreateFreshEnvironment(string path) { if (Directory.Exists(path)) { Directory.Delete(path, true); Console.WriteLine("Cleaned up previous directory"); } return new LightningEnvironment(path, new EnvironmentConfiguration { MapSize = 100_000_000, MaxDatabases = 2 }); } ```

Finally, I found that the lmdb documentation on mdb_cursor_open states "A cursor in a write-transaction can be closed before its transaction ends, and will otherwise be closed when its transaction ends.". I am not a C developer, so i can get something wrong here. But it seems mdb_txn_commit calls mdb_cursors_close, which in turn calls free for transaction cursors. mdb_cursor_close also calls free for a cursor.

Therefore, when we use LightningTransaction.Commit() and then LightningCursor.Dispose() we get cursor double freed.

I do not know where the magic number 2039 (suspiciously close to 2048 ж) ) comes from. Perhaps this is the internal specifics of the Windows allocator.

Could you guys try to dispose the cursor before committing and check if the problem is reproducing?

sebastienros commented 1 year ago

To confirm your hypothesis:

1- I disposed the cursor before committing the transaction. It worked for the 5000 iterations. 2- I then changed the Dispose method of the Cursor to not call mdb_cursor_close if the transaction has been committed. And committed the transaction before disposing the cursor (like your initial repro), and it also fixed the problem.

        private void Dispose(bool disposing)
        {
            if (_handle == IntPtr.Zero)
                return;

            if (!disposing)
                throw new InvalidOperationException("The LightningCursor was not disposed and cannot be reliably dealt with from the finalizer");

            // If the transaction is committed the cursor is already closed
            if (Transaction.State != LightningTransactionState.Commited)
            {
                mdb_cursor_close(_handle);
            }

            _handle = IntPtr.Zero;

            Transaction.Disposing -= Dispose;

            GC.SuppressFinalize(this);
        }

So I agree with your hypothesis, and this fix should solve the problem. Looking at the c code committing the transaction will already close the cursor, so the c# wrapper should not do that again.

sebastienros commented 1 year ago

NB: I noticed a huge performance degradation when disposing the cursor before committing the transaction, in case that matters in your use case.

AlgorithmsAreCool commented 1 year ago

This is interesting, I'll re-check my test code. I thought I tested the cursor disposal when I was chasing this last week, Perhaps I didn't.

CoreyKaylor commented 1 year ago

Anybody up for submitting a PR? I think @sebastienros 's approach seems reasonable. Although, I think the intention was Transaction.State == LightningTransactionState.Commited no?

BTW, love all the activity.

sebastienros commented 1 year ago

@CoreyKaylor I think my code is correct, mdb_txn_commit is invoked by Commit() which sets State = LightningTransactionState.Commited.

https://github.com/CoreyKaylor/Lightning.NET/blob/4db175a9df1db628a4b7c369959478e550ec4b1a/src/LightningDB/LightningTransaction.cs#L299-L301

So we need to call mdb_cursor_close(_handle) when Transaction.State != LightningTransactionState.Commited

I checked other places and I think this could also happen when the Database is dropped and then calls mdb_drop

https://github.com/CoreyKaylor/Lightning.NET/blob/4db175a9df1db628a4b7c369959478e550ec4b1a/src/LightningDB/LightningDatabase.cs#L88

Which will then close the cursor for a connection

https://github.com/LMDB/lmdb/blob/mdb.master/libraries/liblmdb/mdb.c#L11113

And I don't think the State is changed so it would also invoke mdb_cursor_close when the transaction is disposed. Maybe the Transaction object needs a flag to record if the associated cursor needs to be closed.

From what I see in other libraries, they call mdb_cursor_close as soon as abort is invoked. That would be possible if a transaction could reference its cursors.

CoreyKaylor commented 1 year ago

Yes, you're right. I read the comment on the line above wrong in relation to the code right below.

That's probably a reasonable suggestion for this scenario to use a flag on the transaction.

CoreyKaylor commented 1 year ago

Please take a look at PR #159. I believe this should solve the problem using @sebastienros suggestion of conditional check. It also coincidentally stabilized one of the flakey tests that I couldn't figure out.

CoreyKaylor commented 1 year ago

Published version 0.15.0 that includes this change.