apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
385 stars 97 forks source link

csharp: BindStream/Bind statements fail with 'Fatal error. System.AccessViolationException: Attempted to read or write protected memory.' #2265

Open schweers-qb opened 1 month ago

schweers-qb commented 1 month ago

What would you like help with?

Hi there! I've tried and failed to correctly execute the BindStream method from the C# C wrapper API with multiple types of input streams in order to bulk ingest from an Arrow stream into DuckDb. If I use the mock stream implementation found here, my code will work. Otherwise, 'real' file and memory stream reads fail every time with the following error:

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at Apache.Arrow.C.CArrowArrayExporter.ReleaseArray(Apache.Arrow.C.CArrowArray*)

I can read that stack trace just fine and understand the problem it describes, but I can't easily debug the failure of the native code in regard to free'ing memory from my current IDE (Rider). I realize the C# code is less documented and less mature than some of the other runtimes supported by ADBC (and DuckDB), so I suspect I might be calling the API or creating my stream without following a specific pattern that may be required? Here's an example of how I'm attempting to use the API that leads to this error. My code is calling the C API from C# running on .NET 8 on macOS:

public static async Task Main()
    {
        // Load the DuckDB driver following the pattern from the test directory
        using var driver = CAdbcDriverImporter.Load("/local/path/libduckdb-osx-universal/libduckdb.dylib",
            "duckdb_adbc_init");

        // Open a local persistent DB
        using var duckDb =
            driver.Open(new Dictionary<string, string> { { "path", "/local/path/test.db" } });
        using var connection = duckDb.Connect(null);

        // Create a new table on ingest
        using var statement = connection.BulkIngest("temp_table", BulkIngestMode.Create);

        // Create a file stream that reads in some sample Arrow stream data from the arrow-experiments repo
        // Note: if I mock this stream as mentioned above, I can get this code to work. In that scenario, no native memory access is involved in the stream binding
        await using var fileStream = new FileStream("/local/path/arrow-commits.arrows", FileMode.Open,
            FileAccess.ReadWrite);

        // Read the stream, bind it, execute the ingest
        using var reader = new ArrowStreamReader(fileStream);
        statement.BindStream(reader);
        await statement.ExecuteUpdateAsync();
    }
CurtHagenlocher commented 1 month ago

Thanks for the problem report. This looks like a bug -- possibly in the Arrow C interop code -- and I can reproduce it.

schweers-qb commented 1 month ago

@CurtHagenlocher you're welcome! It's the same problem for Bind I might add. I'll swap it to a bug I don't think I can do that, but if there's a preferred way to signal this is a bug, happy to do it!

CurtHagenlocher commented 1 month ago

The exception is being thrown when DuckDB releases a pointer that gets passed to it. In my quick repro, the array in question claims to have three children but only the first of the child pointers seems to be valid.

        [Fact]
        public async Task InsertFromFile()
        {
            // Write temporary file
            Schema schema = new Schema([new Field("key", Int32Type.Default, false), new Field("value", StringType.Default, false)], null);
            RecordBatch recordBatch = new RecordBatch(schema, [
                new Int32Array.Builder().AppendRange([1, 2, 3]).Build(),
                new StringArray.Builder().AppendRange(["foo", "bar", "baz"]).Build()
                ], 3);
            string tempFile = _duckDb.CreateTempPath();
            using (var writeFile = new FileStream(tempFile, FileMode.Create, FileAccess.Write))
            {
                using var writer = new ArrowStreamWriter(writeFile, schema);
                writer.WriteRecordBatch(recordBatch);
            }

            // Create database
            using var database = _duckDb.OpenDatabase("insert_from_file.db");
            using var connection = database.Connect(null);
            using var statement = connection.BulkIngest("temp_table", BulkIngestMode.Create);

            // Read temporary file into new table
            using var readStream = new FileStream(tempFile, FileMode.Open, FileAccess.ReadWrite);
            using var reader = new ArrowStreamReader(readStream);
            statement.BindStream(reader);
            await statement.ExecuteUpdateAsync();
        }
schweers-qb commented 1 month ago

@CurtHagenlocher appreciate the confirming code. How would you recommend to proceed here - open a bug at DuckDB against their ADBC implementation?

CurtHagenlocher commented 1 month ago

I think it's premature to assume that the bug is in DuckDB. I need to debug the repro to figure out what exactly is going wrong. That's not likely to happen during my working day, as I have no shortage of commitments but I'll probably be able to do it this evening.

schweers-qb commented 1 month ago

Ah I see, I read that to assume you meant DuckDB was the problem but I get that may not actually be the case. If you have a recommended way to debug from the C# side through the native ADBC and DuckDB code, please let me know. IE: however you figured this out:

the array in question claims to have three children but only the first of the child pointers seems to be valid.

Thanks for the support on whatever timeline you can manage!

CurtHagenlocher commented 1 month ago

Ah, I should have been able to figure this out without debugging. There's a functionality gap described by https://github.com/apache/arrow/issues/36057 which prevents this from working. Unfortunately, the failure mode is nearly impossible to figure out without a debugger. There's a draft PR at https://github.com/apache/arrow/pull/40992 which should resolve the problem (and which I apparently need to prioritize).

You can work around the problem by cloning the data, which is of course a pretty high price to pay. I did that with

        class ClonedArrayStream : IArrowArrayStream
        {
            readonly IArrowArrayStream _arrowArrayStream;

            public ClonedArrayStream(IArrowArrayStream arrowArrayStream)
            {
                _arrowArrayStream = arrowArrayStream;
            }

            public Schema Schema => _arrowArrayStream.Schema;

            public async ValueTask<RecordBatch?> ReadNextRecordBatchAsync(CancellationToken cancellationToken = default)
            {
                var next = await _arrowArrayStream.ReadNextRecordBatchAsync(cancellationToken);
                return next?.Clone();
            }

            public void Dispose() => _arrowArrayStream.Dispose();
        }

and then binding the cloned stream instead of the original stream.

schweers-qb commented 1 month ago

@CurtHagenlocher ok thank you for the explanation and the working sample. That's similar to what I had working myself if I copied the data / stream around, which like you said, isn't ideal 😄 but doable. This is fine for my needs at the moment though and hoping the other changes land as well to make this zero copy. Thank you again for the support!