apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
328 stars 83 forks source link

csharp: Redesign the C# API to allow more asynchronous operations #1843

Open CurtHagenlocher opened 1 month ago

CurtHagenlocher commented 1 month ago

What feature or improvement would you like to see?

The C# API should allow more asynchronous operations. Currently, it only supports ExecuteQueryAsync and ExecuteUpdateAsync.

My mental model for the four ADBC "object" types is as follows:

The driver is analogous to the ODBC or JDBC driver, or ADO.NET provider. In JDBC, this concept is represented by the java.sql.Driver interface. In ADO.NET, it's represented by the DbProviderFactory class. Because this object is strictly about code, it's not expected to do any IO other than that potentially required by any code e.g. to bring in pages of a binary image from disk.

The database is analogous to the JDBC DataSource, the ODBC "DSN" or the ADO.NET connection string. It represents the information and capability required to create a database connection but does not itself do IO until it tries to create a connection. (This would imply that parameter validation which requires network access -- e.g. to validate a host name -- is deferred until the connection is opened. Perhaps that's too limiting?)

Because neither the driver nor the database is doing IO, neither of them need to have async methods other than Connect, including async cleanup.

The connection represents an actual session with a database. This matches an ODBC connection, a JDBC java.sql.Connection or an ADO.NET DbConnection. Opening a connection, closing it or using it to fetch information about the data source are all operations likely to require IO so these all require async implementations.

The statement is a unit of bookkeeping related to certain types of database operations. In some cases, a connection can only have a single active statement running against it, but it can be useful to have multiple statements even then if, for instance, each one is a prepared statement that represents both client-side and server-side resources. The statement is analogous to an ODBC statement, a JDBC java.sql.Statement or an ADO.NET DbCommand. Due to the need to clean up an in-progress operation or to release server-side resources, the cleanup of a statement might do IO and should therefore support asynchrony. But when the statement is first created, it only represents the potential for future work and so creation is always synchronous.

I'd be curious to hear how well this aligns with others' points of view.

CurtHagenlocher commented 1 month ago

More broadly, there should also be a clear theory of sync/async including preferences for which to implement in the "pure C#" case and how. (The import/export case is going to be governed by the limitations of the C API.)