go/adbc/driver/flightsql: long delay between createPreparedStatement and getFlightInfoPreparedStatement

apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow

https://arrow.apache.org/adbc/

Apache License 2.0

360 stars 86 forks source link

go/adbc/driver/flightsql: long delay between createPreparedStatement and getFlightInfoPreparedStatement #2040

Open aiguofer opened 1 month ago

aiguofer commented 1 month ago

What happened?

We have a Python API that uses the ADBC driver to execute queries against our Java Arrow Flight SQL server. The Python server uses FastAPI + Strawberry, and when we receive a request to execute a query, we spin up a background thread to handle the execution against the AFS server. Multiple threads on the same pod could be executing queries against the AFS server at any given moment.

We recently noticed some issues with hanging queries, and when looking at our DataDog traces, we notice that there is almost a 30 minute difference between the createPreparedStatement request and the getFlightInfoPreparedStatement request.

My initial guess is that this could be related to having multiple requests at the same time through the ADBC driver, but I don't have enough context about how the bindings between Go and Python work.

Is there anything that jumps out at you? Is there anything we could do to help debug this? Here's pictures of the traces:

Screenshot 2024-07-25 at 2 53 31 PM

Stack Trace

No response

How can we reproduce the bug?

No response

Environment/Setup

Python 3.12 ADBC FlightSQL driver 1.0.0 ADBC driver manager 1.0.0

lidavidm commented 1 month ago

....hmm. 30 minutes is rather concerning. Would it be possible to get a thread dump (e.g. with gdb) if you notice that happening?

There is a log level env var that can be set (ADBC_DRIVER_FLIGHTSQL_LOG_LEVEL=debug), but I guess the first question is if this is in any way reproducible, or something that was noticed after-the-fact with telemetry

We should be totally fine with concurrency. There are some locks around internal state but they are meant to be extremely short.

aiguofer commented 1 month ago

Unfortunately it's not reproducible, we noticed it after-the-fact when users reported queries hanging. I only saw this happen with 4 queries, all around the same time, 3 on the same pod and 1 on a separate pod.

We might try setting ADBC_DRIVER_FLIGHTSQL_LOG_LEVEL=debug for some time in case it happens again

lidavidm commented 1 month ago

@zeroshade or @joellubi any ideas here what could've happened?

lidavidm commented 1 month ago

Or if there's any way to get Go to dump stack without killing the process? (Maybe that's another debug feature we should consider - optionally binding to SIGUSR or something)

zeroshade commented 1 month ago

There's a few ways we could do so, though if this is being implemented via Python threads it'll be a bit difficult to track down which stack is the problematic one. Though we could also have a signal just start the pprof profiling to a known filename, and then receiving the signal again stops the profiling. Then it would be easy to pull up and read and check it out

aiguofer commented 4 weeks ago

Well, this hasn't happened again and we've since noticed some memory issues. It's possible that this pod was having memory strain and that caused this slowness. I'm going to close it and report back if we see this again.

aiguofer commented 3 weeks ago

Well, sadly we've seen continued issues with this now that we fixed the memory issues so I'm reopening. Any thoughts on how we might try to debug this?

lidavidm commented 3 weeks ago

We're about to do a release. We could add a bit more logging and hope that can help narrow things down?

lidavidm commented 3 weeks ago

That said, those are 2 separate calls in Python. So maybe the Python side could also be logged. Also if you don't strictly need a prepared statement that could be disabled as well.

lidavidm commented 3 weeks ago

That said, those are 2 separate calls in Python. So maybe the Python side could also be logged.

So specifically, you can explicitly call stmt.adbc_prepare(query) before calling stmt.execute. That'll let you log/time those calls separately, and we can try to narrow down where the extra 30 minutes is coming from.

I think to avoid the prepared statement, you'd have to drop down to the raw API. We could add a config option or something to control this instead...

aiguofer commented 3 weeks ago

ohhh interesting, I noticed the adbc_prepare in the docs recently and was wondering what it was for. We'll try that!

Unfortunately using regular Statements is not an option for us right now. Since neither ADBC nor JDBC use those requests, we have not implemented it.

Side note/question: is there any plan on having (A|J)DBC use regular Statements instead of PreparedStatements? we did notice some recent attempted requests from a Go ADBC driver to the regular Statement flow on our server.

zeroshade commented 3 weeks ago

The Python ADBC bindings tend to use the Prepared statement workflow rather than regular statements as a performance / safety scenario to ensure compatibility with DBAPI and parameter binding, while also allowing an easier workflow for users and reducing code duplication in the bindings.

The native implementations (the actual ADBC spec, the Go implementation, C# impl, etc.) expose the lower level constructs which allow users to make the decision themselves whether or not they call Prepare. So those implementations won't use the Prepare workflow unless a user explicitly calls the StatementPrepare / Statement.Prepare functions

lidavidm commented 3 weeks ago

ohhh interesting, I noticed the adbc_prepare in the docs recently and was wondering what it was for. We'll try that!

I'll try to add an example.

So those implementations won't use the Prepare workflow unless a user explicitly calls the StatementPrepare / Statement.Prepare functions

I'll try to add examples here too. And/or think of a way to let you opt out of this in the DBAPI bindings

lidavidm commented 3 weeks ago

See https://github.com/apache/arrow-adbc/pull/2097