apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
360 stars 86 forks source link

[Format] Formalize thread safety and concurrency guarantees #64

Closed lidavidm closed 2 years ago

lidavidm commented 2 years ago

Things to consider

Example: libpq disallows concurrent queries through a single PGconn, so multiple AdbcStatements can't be used if they share a connection (and the semantics of that get murky anyways) - but what should the behavior be?

zeroshade commented 2 years ago

I'll chime in from the Go database/sql package:

One thing to keep in mind though is that the semantics of Goroutines do not require execution on the same OS thread and are subject to the Go scheduler. This is only really relevant in the case where we'll use CGO to communicate with the actual C libraries. So while there are promises from the database/sql package in terms of concurrent usage, it can still be used my multiple different OS threads and multiple different Goroutines, just not concurrently.

Go does provide a method to lock a particular goroutine to an OSthread if needed, so our thread safety guarantees will need to also address if we expect a given object to be allowed to be used by multiple threads (making TLS infeasible for example) or not.

lidavidm commented 2 years ago

Thanks, I think that aligns with what I am basically assuming so far: no concurrent access (though maybe a particular driver can relax this, e.g. Flight SQL), but also no guarantees on thread identity (I think some APIs call this "serialized" access).

I think the main thing will be to either define that concurrent statements cannot be created from the same connection (may be overly restrictive, especially for Flight SQL), or that they cannot necessarily be used concurrently.

paleolimbot commented 2 years ago

no guarantees on thread identity

This is something I've noticed about calls to the struct ArrowArrayStream methods from both Arrow and DuckDB (sometimes they happen on different threads although properly serialized). It precludes a clean separation between producer and consumer if the producer wants/needs to evaluate R code, which can only happen on exactly one thread. We can, of course, error if we detect a call from the wrong thread, but consumers doing that kind of threading may already have the facilities to coordinate that kind of access (e.g., Arrow C++ has the RunInSerialExecutor). A concrete example of a producer wanting to evaluate R code would be a generic R DBI driver for ADBC or an Acero driver whose SQL calls user-defined functions registered in the R package.

lidavidm commented 2 years ago

@paleolimbot in that case, wouldn't it be the responsibility of the R/ADBC bridge to manage the threading? Requiring that all calls to ADBC happen from the same thread is rather restrictive, and it sounds like even that wouldn't be sufficient (it would have to be from a single main thread, right?)

pitrou commented 2 years ago

@paleolimbot in that case, wouldn't it be the responsibility of the R/ADBC bridge to manage the threading?

That sounds reasonable to me (if at all possible). I presume it is possible to queue R function calls from another thread?

paleolimbot commented 2 years ago

I presume it is possible to queue R function calls from another thread?

It's an absolute nightmare to do safely without the consumer evaluating the function call in a very special way that both the producer and consumer know about (e.g., how we do it in the Arrow R package); however, the fact that R never considered this shouldn't impose restrictions on ADBC that make it painful to use.

I just bring it up in case limiting method calls to the thread that created the connection is not in fact painful, and with the vague feeling that the sorry sod who has to work around this is going to end up being me.

lidavidm commented 2 years ago

So digging around, I would probably vote that for ADBC, we roughly follow the JDBC/ODBC concurrency/thread safety guarantees.

Here, a statement is just a handle/object to configure query state, and not necessarily a prepared statement (so, using JDBC/ODBC terminology, or a Cursor in Python DBAPI terms).

Connections

Flight SQL - connections are safe for concurrent access (Send + Sync) Golang database/sql - it's not stated whether Conn is safe for concurrent use libpq - a single connection cannot be used by multiple threads at once, but the library is thread-safe (Send) JDBC - serial access is safe (Send) (e.g. see Oracle driver docs) but concurrent access is driver-dependent ODBC - all handles must be thread safe, but the driver may serialize accesses internally (Send + Sync)

Statements

The question here is whether you should be able to create two AdbcStatements and use both of them, either from the same thread or different threads. (This question will sound a little funny to users of some APIs; I'm thinking about JDBC/ODBC/DBAPI here)

Flight SQL - no explicit concept of a statement outside of a prepared statement. It is not defined whether a FlightInfo can be reused or not, but the individual endpoints within can be accessed in parallel Golang database/sql - no explicit concept of a statement outside of a prepared statement libpq - the protocol only allows for a single query at a time (pipelining can get around this) JDBC - "Each Connection object can create multiple Statement objects that may be used concurrently by the program." (JDBC 4.1 spec). And "only one ResultSet object per Statement object can be open at the same time". ODBC - multiple "active" statements are allowed on a single connection, but the driver may limit this number (Send + Sync)

Prepared statements

Flight SQL - it is not defined whether prepared statements are thread safe or not. However, parameter binding is stateful so it doesn't really make sense to try to use them concurrently Golang database/sql - allows concurrent use from multiple goroutines (this is unusual!) libpq - since connections are not thread safe, neither are prepared statements JDBC - prepared statements are stateful (e.g. parameters) so it does not make sense to use them concurrently ODBC - same as statements

lidavidm commented 2 years ago

I just bring it up in case limiting method calls to the thread that created the connection is not in fact painful, and with the vague feeling that the sorry sod who has to work around this is going to end up being me.

Connection pooling is quite common, so this limitation would be extraordinarily painful

paleolimbot commented 2 years ago

No worries then - ignore me!

lidavidm commented 2 years ago

It's still worth thinking about, even if the answer is no :)

zeroshade commented 2 years ago

An individual statement can be used multiple times, but result sets cannot be read concurrently (that is: executing a statement invalidates prior result sets)

Are we referring to result sets from different queries or only putting a limitation on result sets from a single prepared statement? Go's Query returns a Rows object which is then used to iterate over the results, but there's no requirement on when that iteration can happen. To guarantee safety we'd have to buffer the entire result set into memory to ensure that they don't get invalidated by a subsequent query, which would be extremely undesireable. This would also technically be true for the proposed ADBC API which just populates the ArrowArrayStream* and let's the consumer decide how/when to iterate that stream and release it. This seems like an extraordinarily painful requirement.

Prepared statements follow regular statements in behavior. (This makes Golang support a little weird; Golang is exceptional in allowing concurrent use of prepared statements)

So I looked at the docs again, and as per my comment here I was wrong. While the prepared statement can be used by multiple goroutines, it will not be used concurrently by those goroutines. Basically the database/sql package's sql.Stmt object is safe for concurrent use, but it internally ensures serialization to it's calls to the underlying Driver's driver.Stmt object. It is, however, bound to a single Connection object.

lidavidm commented 2 years ago

Are we referring to result sets from different queries or only putting a limitation on result sets from a single prepared statement?

So to be clear, "statement" here refers to any query, since there is a separate structure (AdbcStatement) to hold query state. (This is analogous to JDBC/ODBC Statement, or DBAPI Cursor.)

I am only talking about a single query, so this is sort of irrelevant to Go. For Go, I would expect each call to Query to initialize and use a new AdbcStatement, at which point it would be up to the driver to manage concurrency. Does that sound reasonable?

If we add the one-shot Query to ADBC itself, it would have the same semantics as creating an independent AdbcStatement for each call.

While the prepared statement can be used by multiple goroutines, it will not be used concurrently by those goroutines. Basically the database/sql package's sql.Stmt object is safe for concurrent use, but it internally ensures serialization to it's calls to the underlying Driver's driver.Stmt object. It is, however, bound to a single Connection object.

Thanks - so, basically consistent with every other API then.

zeroshade commented 2 years ago

I am only talking about a single query, so this is sort of irrelevant to Go. For Go, I would expect each call to Query to initialize and use a new AdbcStatement, at which point it would be up to the driver to manage concurrency. Does that sound reasonable?

So, if a consumer hasn't picked up all of the results from a call to a prepared statement before executing it again, they will lose any remaining results? That seems difficult to enforce or otherwise prevent a user from doing. Is there a reason we can't keep the ArrowArrayStream* result set as independent? For example, using a prepared statement to query, then handing the result stream off to another thread to consume while exeucting it again with different arguments. (FlightSQL would support this easily since each execution would be a separate FlightInfo ticket)

lidavidm commented 2 years ago

Flight SQL would be able to do this easily, but libpq can't do this at all, for instance - that's the balance I'm trying to strike here. (Also even for Flight SQL, I'm not certain whether any implementations actually support concurrent execution of a prepared statement, because as noted, binding parameters is stateful - so it's unclear whether you are meant to be able to bind parameters and execute a prepared statement again while still consuming the results of a prior execution)

lidavidm commented 2 years ago

I suppose we could keep the ArrowArrayStream as independent, and that would require Execute to block and accumulate all results in the case of libpq (which is also probably reasonable - actually I think that's what it already does unless you explicitly go for single-row mode)

lidavidm commented 2 years ago

Hmm, but that might impact something like SQLite or DuckDB - I'd have to test how DuckDB behaves here, though.

In general, my understanding of most APIs is that prepared statements aren't really set up to benefit concurrent execution, only repeated execution from a single logical chain of execution...so Go still stands out here as the odd one out

lidavidm commented 2 years ago

Draft of some clarifying text in #66

lidavidm commented 2 years ago

So DuckDB's API fully materializes the query result. I guess we can do that with libpq as well. Then we can support Go's use case (execute statement, hand off result set to something else, execute statement again). I suppose worst case, a driver that absolutely can't support this natively, but which also would prefer not to eagerly materialize results, could either error or lazily materialize the results when forced to (though I worry that leads to potential performance pitfalls).

lidavidm commented 2 years ago

Long story short @zeroshade I'll tweak #66 to clarify that result sets are not invalidated except by closing the statement, but to caution that some drivers may have to materialize result sets to support this?

I think the current SQLite driver will be the hard one to support (it will need to be heavily refactored since it currently iterates over the results) but also SQlite probably isn't the most important target anyways

lidavidm commented 2 years ago

It will still create difficulties for drivers that wish to wrap JDBC/ODBC, which explicitly disallow this, though. Those will have to materialize all results in memory.

pitrou commented 2 years ago

Having to materialize all results instead of streaming them sounds like a rather major pitfall.

lidavidm commented 2 years ago

JDBC/ODBC and Flight SQL/Go are unfortunately at odds here, though. I'm not sure if there's a great way to express that Flight SQL lets you do this without penalty without also penalizing the JDBC/ODBC case.

JDBC/ODBC wrappers could also potentially lazily materialize; a second call to Execute() would force the original result set to be materialized before executing the new query. This would mean the driver has to manage that synchronization itself.

pitrou commented 2 years ago

I'm not sure if there's a great way to express that Flight SQL lets you do this [...]

Is it important to express it?

lidavidm commented 2 years ago

Possibly not, given that most APIs don't (or, they materialize up front).

I'm also still a little skeptical whether Flight SQL truly allows this, or whether it's just not been clearly defined. Maybe @jduo can chime in on the intent (can a client execute a prepared statement twice, then read the result sets in concurrently?)

lidavidm commented 2 years ago

Hmm. Go's docs state this for Stmt:

When the Stmt needs to execute on a new underlying connection, it will prepare itself on the new connection automatically.

@zeroshade does this mean the execution is truly concurrent on a single connection? It seems Go is cheating and using a connection pool internally?

zeroshade commented 2 years ago

In general, my understanding of most APIs is that prepared statements aren't really set up to benefit concurrent execution, only repeated execution from a single logical chain of execution...so Go still stands out here as the odd one out

To be fair, nothing in the Go docs guarantees one way or the other as far as viability in this regard. So I went and dug into the actual database/sql package code...

It turns out that under the hood, the database/sql package locks the connection until you call close on the Rows object, indicating you don't need the result set anymore. If you attempt to execute a prepared statement a second time before you close the previous result set, it internally grabs a new connection from the pool and re-prepares itself on the new connection and executes there, allowing the existing result set to continue just fine.

zeroshade commented 2 years ago

@lidavidm I was writing that response and hten stepped away for a meeting and came back to all your comments haha. but TL;DR: looks like under the hood the database/sql package in Go expects similar semantics that executing the same prepared statement on the same connection invalidates the previous results, and actively ensures that doesn't happen for consumers.

lidavidm commented 2 years ago

I should've read the Go docs a little more closely :sweat_smile: in that case I'll keep the linked PR as-is, if it looks reasonable. Thanks for the help!

zeroshade commented 2 years ago

@lidavidm Don't beat yourself up, it wasn't in the docs haha. I had to read the actual source code to figure that out :smile: