Giorgi / DuckDB.NET

Bindings and ADO.NET Provider for DuckDB
https://duckdb.net
MIT License
338 stars 61 forks source link

Feature Request: Apache Arrow support #26

Open charlie430 opened 2 years ago

charlie430 commented 2 years ago

I'm very interested in Apache Arrow being supported for the in-memory scenario.

Is there any information you can provide on when that might be supported?

Thanks!

Giorgi commented 2 years ago

It's not implemented at the moment but feel free to send a PR for the API: Arrow Interface

nazar554 commented 3 months ago

Would it be ok to use Apache.Arrow package as a dependency? It has data types for some objects, like CArrowSchema for duckdb_arrow_schema

Giorgi commented 3 months ago

Would that be added to the Binding project or Data project?

nazar554 commented 3 months ago

Probably Binding project, but the package might be too heavy for it. I guess for now I can create a DuckDBArrowSchema with private fields that has compatible struct layout. So consumers can just cast the pointer to CArrowSchema*.

Giorgi commented 3 months ago

Honestly, I haven't looked much into Arrow and can't tell now for sure. Feel free to join DuckDB Discord, we can discuss it in more detail in the dotnet channel.

CurtHagenlocher commented 2 months ago

An alternative for getting data as Arrow could be to use the C# ADBC implementation with the generic driver importer. This code is not very mature yet, but you can run queries and get the result back as Arrow. Example:

            using AdbcDriver duckdb = CAdbcDriverImporter.Load("D:\\testdata\\duckdb.dll", "duckdb_adbc_init");
            using AdbcDatabase db = duckdb.Open(new Dictionary<string, string> { { "path", "d:/testdata/ddbt.db"} });
            using AdbcConnection cn = db.Connect(null);
            using AdbcStatement stmt = cn.CreateStatement();
            stmt.SqlQuery = "CREATE TABLE integers(foo INTEGER, bar INTEGER);";
            stmt.ExecuteUpdate();

            stmt.SqlQuery = "INSERT INTO integers VALUES (3, 4), (5, 6), (7, 8);";
            stmt.ExecuteUpdate();

            stmt.SqlQuery = "SELECT * from integers";
            var results = stmt.ExecuteQuery();

            // results.Stream is an IArrowArrayStream, which lets you get the schema
            // and a set of record batches

NOTE that this code is not super mature and we haven't yet reached a 1.0 release.

Giorgi commented 2 months ago

Nice! I think it would be great if there was a way to go from DuckDBConnection (provided by this library) to AdbcConnection. I can expose the underlying pointer to the database (obtained by duckdb_open and duckdb_connect) but looks like there is no way to convert such pointer to an AdbcConnection object.

CurtHagenlocher commented 2 months ago

I know next to nothing about DuckDB internals, so I have no idea how plausible something like this is. For ADBC, we need an array of function pointers that defines the ADBC driver API -- this is what duckdb_adbc_init is initializing -- and the connection is then roughly an indirected opaque pointer that gets passed to some of these function pointers.