Open schweers-qb opened 1 month ago
Thanks for the problem report. This looks like a bug -- possibly in the Arrow C interop code -- and I can reproduce it.
@CurtHagenlocher you're welcome! It's the same problem for Bind
I might add. I'll swap it to a bug I don't think I can do that, but if there's a preferred way to signal this is a bug, happy to do it!
The exception is being thrown when DuckDB releases a pointer that gets passed to it. In my quick repro, the array in question claims to have three children but only the first of the child pointers seems to be valid.
[Fact]
public async Task InsertFromFile()
{
// Write temporary file
Schema schema = new Schema([new Field("key", Int32Type.Default, false), new Field("value", StringType.Default, false)], null);
RecordBatch recordBatch = new RecordBatch(schema, [
new Int32Array.Builder().AppendRange([1, 2, 3]).Build(),
new StringArray.Builder().AppendRange(["foo", "bar", "baz"]).Build()
], 3);
string tempFile = _duckDb.CreateTempPath();
using (var writeFile = new FileStream(tempFile, FileMode.Create, FileAccess.Write))
{
using var writer = new ArrowStreamWriter(writeFile, schema);
writer.WriteRecordBatch(recordBatch);
}
// Create database
using var database = _duckDb.OpenDatabase("insert_from_file.db");
using var connection = database.Connect(null);
using var statement = connection.BulkIngest("temp_table", BulkIngestMode.Create);
// Read temporary file into new table
using var readStream = new FileStream(tempFile, FileMode.Open, FileAccess.ReadWrite);
using var reader = new ArrowStreamReader(readStream);
statement.BindStream(reader);
await statement.ExecuteUpdateAsync();
}
@CurtHagenlocher appreciate the confirming code. How would you recommend to proceed here - open a bug at DuckDB against their ADBC implementation?
I think it's premature to assume that the bug is in DuckDB. I need to debug the repro to figure out what exactly is going wrong. That's not likely to happen during my working day, as I have no shortage of commitments but I'll probably be able to do it this evening.
Ah I see, I read that to assume you meant DuckDB was the problem but I get that may not actually be the case. If you have a recommended way to debug from the C# side through the native ADBC and DuckDB code, please let me know. IE: however you figured this out:
the array in question claims to have three children but only the first of the child pointers seems to be valid.
Thanks for the support on whatever timeline you can manage!
Ah, I should have been able to figure this out without debugging. There's a functionality gap described by https://github.com/apache/arrow/issues/36057 which prevents this from working. Unfortunately, the failure mode is nearly impossible to figure out without a debugger. There's a draft PR at https://github.com/apache/arrow/pull/40992 which should resolve the problem (and which I apparently need to prioritize).
You can work around the problem by cloning the data, which is of course a pretty high price to pay. I did that with
class ClonedArrayStream : IArrowArrayStream
{
readonly IArrowArrayStream _arrowArrayStream;
public ClonedArrayStream(IArrowArrayStream arrowArrayStream)
{
_arrowArrayStream = arrowArrayStream;
}
public Schema Schema => _arrowArrayStream.Schema;
public async ValueTask<RecordBatch?> ReadNextRecordBatchAsync(CancellationToken cancellationToken = default)
{
var next = await _arrowArrayStream.ReadNextRecordBatchAsync(cancellationToken);
return next?.Clone();
}
public void Dispose() => _arrowArrayStream.Dispose();
}
and then binding the cloned stream instead of the original stream.
@CurtHagenlocher ok thank you for the explanation and the working sample. That's similar to what I had working myself if I copied the data / stream around, which like you said, isn't ideal 😄 but doable. This is fine for my needs at the moment though and hoping the other changes land as well to make this zero copy. Thank you again for the support!
What would you like help with?
Hi there! I've tried and failed to correctly execute the
BindStream
method from the C#C
wrapper API with multiple types of input streams in order to bulk ingest from an Arrow stream into DuckDb. If I use the mock stream implementation found here, my code will work. Otherwise, 'real' file and memory stream reads fail every time with the following error:I can read that stack trace just fine and understand the problem it describes, but I can't easily debug the failure of the native code in regard to free'ing memory from my current IDE (Rider). I realize the C# code is less documented and less mature than some of the other runtimes supported by ADBC (and DuckDB), so I suspect I might be calling the API or creating my stream without following a specific pattern that may be required? Here's an example of how I'm attempting to use the API that leads to this error. My code is calling the
C
API from C# running on .NET 8 on macOS: