curiosity-ai / rocksdb-sharp

.net bindings for the rocksdb by facebook
BSD 2-Clause "Simplified" License
163 stars 41 forks source link

SEHException thrown after disposal and reopen #5

Closed leesei closed 2 years ago

leesei commented 3 years ago

Example code is here: https://github.com/leesei/RocksDBExample

It writes data to RocksDb and read data from it after disposing the handle.
While the code works in *nix environment, we face this exception on Windows environment.

Fatal error. System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception.
   at RocksDbSharp.Native.rocksdb_get(IntPtr, IntPtr, Byte[], Int64, IntPtr ByRef, RocksDbSharp.ColumnFamilyHandle)
   at RocksDbSharp.Native.rocksdb_get(IntPtr, IntPtr, Byte[], Int64, RocksDbSharp.ColumnFamilyHandle)
   at RocksDbSharp.RocksDb.Get(Byte[], Int64, RocksDbSharp.ColumnFamilyHandle, RocksDbSharp.ReadOptions)
   at RocksDbSharp.RocksDb.Get(Byte[], RocksDbSharp.ColumnFamilyHandle, RocksDbSharp.ReadOptions)
   at RocksDBExample.Program.Main(System.String[])
fassadlr commented 3 years ago

Hi, we are also seeing this exact problem. Any comments from the developers?

fassadlr commented 3 years ago

@warrenfalk @theolivenbaum

theolivenbaum commented 3 years ago

@leesei apologies, missed that you had some code to reproduce the issue! @fassadlr which environment are you using? windows also? I just run the example provided and it works on my machine 😥

Hello World!
key[1], value:[a]
key[2], value:[b]
key[3], value:[c]
key[1], value:[abc]
key[2], value:[efd]
key[3], value:[xyz]
key[4], value:[85A2494430A84C6F636174696F6E06A84C656973696F6E739187A158CD040CA159CD01E8A157CC91A14848A54C6162656CA56E706D6C79A553636F7265CB3FE0000000000000A4506174689482A158CD0584A159CD015782A158CD0121A159CD01DA82A158CCBCA159CD032482A158CD04B6A159CCBDAE526570726573656E746174697665C2A7536B6970706564C2]
key[5], value:[85A2494430A84C6F636174696F6E06A84C656973696F6E739187A158CD040CA159CD01E8A157CC91A14848A54C6162656CA56E706D6C79A553636F7265CB3FE0000000000000A4506174689482A158CD0584A159CD015782A158CD0121A159CD01DA82A158CCBCA159CD032482A158CD04B6A159CCBDAE526570726573656E746174697665C2A7536B6970706564C2]
key[6], value:[85A2494430A84C6F636174696F6E06A84C656973696F6E739187A158CD040CA159CD01E8A157CC91A14848A54C6162656CA56E706D6C79A553636F7265CB3FE0000000000000A4506174689482A158CD0584A159CD015782A158CD0121A159CD01DA82A158CCBCA159CD032482A158CD04B6A159CCBDAE526570726573656E746174697665C2A7536B6970706564C2]
key[4], length[143], buffer value
key[5], length[143], buffer value
key[6], length[143], buffer value

Close and Reopen Database.

key[4], value:[85A2494430A84C6F636174696F6E06A84C656973696F6E739187A158CD040CA159CD01E8A157CC91A14848A54C6162656CA56E706D6C79A553636F7265CB3FE0000000000000A4506174689482A158CD0584A159CD015782A158CD0121A159CD01DA82A158CCBCA159CD032482A158CD04B6A159CCBDAE526570726573656E746174697665C2A7536B6970706564C2]
key[5], value:[85A2494430A84C6F636174696F6E06A84C656973696F6E739187A158CD040CA159CD01E8A157CC91A14848A54C6162656CA56E706D6C79A553636F7265CB3FE0000000000000A4506174689482A158CD0584A159CD015782A158CD0121A159CD01DA82A158CCBCA159CD032482A158CD04B6A159CCBDAE526570726573656E746174697665C2A7536B6970706564C2]
key[6], value:[85A2494430A84C6F636174696F6E06A84C656973696F6E739187A158CD040CA159CD01E8A157CC91A14848A54C6162656CA56E706D6C79A553636F7265CB3FE0000000000000A4506174689482A158CD0584A159CD015782A158CD0121A159CD01DA82A158CCBCA159CD032482A158CD04B6A159CCBDAE526570726573656E746174697665C2A7536B6970706564C2]
key[4], length[143], buffer value
key[5], length[143], buffer value
key[6], length[143], buffer value
fassadlr commented 3 years ago

Thanks for getting back to me @theolivenbaum. Yes all of our issues comes from Windows environments. We are running a crypto currency wallet which utilizes your datastore in a multi threaded environment (www.stratisplatform.com).

I dont have any text logs but the issue reported by @leesei is exactly we receive on some Windows environments. I have not experienced this issue personally and @leesei's code to reproduce also works fine for me.

Here is an example screenshot:

unknown (1)

dangershony commented 3 years ago

We are experiencing this as well @checho1989 @turcol @sondreb in windows https://github.com/block-core/blockcore/issues/341

leesei commented 3 years ago

From our experience it may be related to the length of value. We've to go down one level to use LevelDB (pun intended) in our project due to time contraint. Sorry for not being able to invest more time to reproduce the issue. But I'll keep an eye on this issue and looking forward for a fix.

khalluudi commented 3 years ago

I am experiencing the same issue as well. Loading 10 million rows into rocksdb on an Azure VM with Standard SSD LRS storage. On my local physical machine with normal SSD drive, all 10 million rows are loaded fine. However, on the VM, the exception above is thrown.

It would be great to know what the issue is and if there is a workaround.

khalluudi commented 3 years ago

So I found out that this issue is due to out of memory issues on the VM in Azure. I am using WriteBatch for bulk load and flushing it for every 10 million rows. When I reduced the flush batch size to one million, it ran without any issues.

theolivenbaum commented 3 years ago

@leesei I finally have a way to reproduce this - hit it when testing our software on my wife's laptop! Will investigate and come back to you asap!

@khalluudi I think you might have hit a separate issue with the same Interop error code. But nice to hear you found a solution!

theolivenbaum commented 3 years ago

@leesei quick update, we found a few more machines that hit the same issue on our side. I tested it against old versions, and it seems like the issue we have was introduced on version 6.20.3.

6.11.4.12240 works 6.14.5.13874 works 6.15.5.15759 works 6.17.3.16253 works 6.20.3.19177 doesn't work

But I'm not sure this is the same issue that you have, as we hit it on DB.Open already.

In any case, could you do the same test changing the version and see if you can find out which version breaks?

leesei commented 3 years ago

This issue is weirder than I think.

On the machine that I used to report the issue (A, Windows 10 Home, i7-4930K):

6.11.4.12240 works 6.14.5.13874 works !!! different behavior, previously crash with 6.14.5.13874 (the version in https://github.com/leesei/RocksDBExample) 6.15.5.15759 doesn't work 6.17.3.16253 doesn't work 6.20.3.19177 doesn't work

EDIT: shortly afterwards I was unable to reproduce the issue 6.11.4.12240 works 6.14.5.13874 works 6.15.5.15759 works 6.17.3.16253 works 6.20.3.19177 works

On another machine (B, Windows 10 Home, i7-7800X):

6.11.4.12240 works 6.14.5.13874 works 6.15.5.15759 works 6.17.3.16253 works 6.20.3.19177 works

On yet another machine (C, Windows 10 Home, i9-10900K):

6.11.4.12240 works 6.14.5.13874 works 6.15.5.15759 works 6.17.3.16253 works 6.20.3.19177 works

All machines are recent in terms of Windows Update Not sure if this is related, B has Windows 21H1 installed, and A, C does not What else can I dump from the environment?

oleg2k commented 2 years ago

Crashes when executing code below. Setting Compression.No eliminate crashes (remember to delete corrupted Db).

public static void Main()
{
    for (var i = 1; i <= 1000; i++)
    {
        Console.Write($"iteration {i}...");
        RocksCrashTest();
        Console.WriteLine(" passed.");
    }
    Console.WriteLine("done");
}

private static readonly byte[] Data = { 0, 0, 0, 1 };

static void RocksCrashTest()
{
    var dbo = new DbOptions()
        .SetCreateIfMissing()
        .SetCreateMissingColumnFamilies();

    var cfo = new ColumnFamilyOptions();

    // prevent crashes and database corruptions by setting "Compression.No"
    // cfo.SetCompression(Compression.No);

    var cf = new ColumnFamilies()
    {
        { "default", cfo },
    };

    var root = Path.GetTempPath(); // @"E:\RocksData";
    var path = Path.Combine(root, "CrashTest\\test.db");
    Directory.CreateDirectory(path);

    using var db = RocksDb.Open(dbo, path, cf);

    var defaultCf = db.GetColumnFamily("default");

    db.Put(Data, Data, defaultCf);
}
oleg2k commented 2 years ago

Moreover, it crashes in a similar way in native. 6.27.3 lib files built on Windows.

include

include "rocksdb/db.h"

int main() { constexpr char data[4] = { 0,0,0,1 };

for (auto i = 0;; i++)
{
    rocksdb::Options dbOptions;
    dbOptions.create_if_missing = true;
    dbOptions.create_missing_column_families = true;

    auto cfOptions = rocksdb::ColumnFamilyOptions();
    cfOptions.OptimizeForPointLookup(16);
    //cfOptions.compression = rocksdb::kNoCompression; // no crash if uncommented

    std::vector<rocksdb::ColumnFamilyDescriptor> cfDescriptors;
    cfDescriptors.emplace_back("default", cfOptions);

    std::vector<rocksdb::ColumnFamilyHandle*> cfHandles;

    rocksdb::DB* db;

    const auto status = rocksdb::DB::Open(dbOptions, R"(E:\RocksData\ConsoleApplication31\test.db)", cfDescriptors, &cfHandles, &db);

    if (!status.ok())
    {
        std::cout << status.ToString() << "\n";
        break;
    }

    const auto stateCf = cfHandles[0];

    db->Put(rocksdb::WriteOptions(), stateCf, data, data);

    delete db;

    std::cout << i << "\n";
}

}

AllegraMcdonnell123 commented 2 years ago

hi, does anyone know if the reproduces in the newest version of rocksdb? :) @leesei @oleg2k @theolivenbaum

fmarkor commented 2 years ago

This is a strange one indeed.

We've tried downgrading versions and we've tried upgrading versions. In every case, your wrapper will run the first time on one of our employees Windows machine, however when he tries to re-open the daemon (i.e. after data is written to disk), an SEH exception gets thrown and we're stuck only knowing that rocksdb_open_column_families threw an error somehow.

The machine where it fails to re-open the database is a Windows 10, 64-bit install running on an Intel Core i5 2320.

It should be mentioned that the code runs flawlessly on 5 other machines its been tested on.

Here's the full stacktrace:

   at RocksDbSharp.Native.rocksdb_open_column_families(IntPtr, System.String, Int32, System.String[], IntPtr[], IntPtr[])
   at RocksDbSharp.RocksDb.Open(RocksDbSharp.DbOptions, System.String, RocksDbSharp.ColumnFamilies)
   at Discreet.DB.DisDB..ctor(System.String)
   at Discreet.DB.DisDB.Initialize()
   at Discreet.DB.DisDB.GetDB()
   at Discreet.Daemon.TXPool..ctor()
   at Discreet.Daemon.TXPool.Initialize()
   at Discreet.Daemon.TXPool.GetTXPool()
   at Discreet.Daemon.Daemon..ctor()
   at Discreet.Program+<Main>d__3.MoveNext()
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.__Canon, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](System.__Canon ByRef)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[System.__Canon, System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](System.__Canon ByRef)
   at Discreet.Program.Main(System.String[])
   at Discreet.Program.<Main>(System.String[])

We're currently looking into if this is something that can be fixed or if we'll have to abandon the wrapper entirely until it's in a more stable state. The unfortunate thing is that we cannot get more information that the error occured at open_column_families in the native DLL. We get a similar error on another developers Arch Linux machine.

Hopefully @warrenfalk @theolivenbaum or someone else might have an idea of what's wrong? :)

EdIt: Thanks to @oleg2k for the research into compression fixing this issue. None of the previously mentioned systems had any issue with this wrapper after ColumnFamilyOptions was initialized with new ColumnFamilyOptions().SetCompression(Compression.No). Of course, compression would be nice, but for now, it's an acceptable solution. When our schedule is less tight, we'll start some research into why the native library fails on some systems when compressions is enabled and report back.

theolivenbaum commented 2 years ago

Hi @fmarkor! We're seeing the same issues on some machines, and in our case this can be easily tracked down to CPUs lacking AVX2. Are you seeing the same thing on your side? I was wondering recently if this is due to the compression-related imports when compiling RockDB - seems like @oleg2k was spot on the issue! I'll follow from here to see if there is any missing flags for the imports that disable their use of AVX2, otherwise I'll just do the same and set compression to No when running on AVX2-less CPUs. Luckily this can be easily detected on C#.

theolivenbaum commented 2 years ago

Let's see what happens: https://github.com/curiosity-ai/rocksdb-sharp/commit/2626ca16200253e38c976de3309092d016c83d9c

fmarkor commented 2 years ago

We checked it again on the machines that were failing, and they all lacked AVX2, so that did indeed seem to be the determining factor, thank you so much @theolivenbaum! However, purely from C# I'm having trouble finding out if we can detect is AVX2 is supported from .NET 6 (i.e. multiplatform generalized checks), rather than Windows-specific ones.

Do you happen to know if there's any namespace that can do the equivalent of IsProcessorFeaturePresent without Windows-specific P/Invoke?

theolivenbaum commented 2 years ago

There's is, it's actually quite simple, you can use something like this:

System.Runtime.Intrinsics.X86.Avx2.IsSupported

It's actually quite nice when writing C# optimized code as checks are optimized away on jitting so there's no performance impact in runtime.

You can check for other instruction sets too using the Intrinsics namespace

theolivenbaum commented 2 years ago

Just found something interesting! https://github.com/microsoft/vcpkg/issues/15794

theolivenbaum commented 2 years ago

Let's see - using my own vcpkg registry now to see if this works, if so I can send the port upstream to vcpkg https://github.com/curiosity-ai/vcpkg-registry

theolivenbaum commented 2 years ago

This might be finally fixed on the latest release! It seems like our usage of vcpkg for windows builds was introducing the unwanted dependency on avx2

fmarkor commented 2 years ago

I believe @theolivenbaum was spot on when he said AVX2 was blocking the dependency. He had me test it on one of my employees machine who uses an older CPU and previously had the above described problems, and it worked with compression enabled. Thank you so much @oleg2k and @theolivenbaum for working with me on this issue. I believe this can be consider a closed matter now.

jFeVq3f

theolivenbaum commented 2 years ago

Closing this for now, it anyone still hits it with the latest versions please let us know!

theolivenbaum commented 2 years ago

Reported again by some of our users - so reopening :(

fmarkor commented 2 years ago

@theolivenbaum Did you get the specifications (OS/hardware)/version the users running into this problem was having? We have gotten it to run on both Linux and all versions of Windows (back to 7) and have had no issues using version 7.1.1.2841 - I'd be happy to have me/our team look into this issue with you.

Also was the exception thrown at column families (i.e. compression would be the issue) or when calling rocksdb_get?

To get it to run on Unix-based systems you need to fetch the libz dependencies for compression not to error.

theolivenbaum commented 2 years ago

I think the issue they're seeing now is actually related to non-ASCII characters on the DB path. I'm trying to fix this with https://github.com/curiosity-ai/rocksdb-sharp/commit/539e7f95fd82c4fe288380a04d82ba75f7e43f3a, let's see!