apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[Python] Add client CookieMiddleware to pyarrow #34016

Open ravjotbrar opened 1 year ago

ravjotbrar commented 1 year ago

Describe the enhancement requested

We currently implement CookieMiddleware in our own Python Arrow Flight client to connect to a Dremio Flight endpoint. I plan to move this middleware into pyarrow, similar to the client cookie middleware for java .

Component(s)

Python

lidavidm commented 1 year ago

Ideally we would just provide Python bindings to the C++ implementation instead of having the work duplicated 3x

ravjotbrar commented 1 year ago

That's a fair point. Will look into it.

indigophox commented 1 year ago

@lidavidm I've been working with @ravjotbrar and I can't see that implementing the middleware callback contract across an FFI is a good option here. My thinking here:

Perhaps I am missing something further about why it would in fact be for the best to use the C++ impl anyways?

lidavidm commented 1 year ago

Just a quick response, but the Python middleware is the one that incurs FFI overhead on every call, not the C++ one.

On Wed, Feb 8, 2023, at 17:09, Paul Nienaber wrote:

@lidavidm https://github.com/lidavidm I've been working with @ravjotbrar https://github.com/ravjotbrar and I can't see that implementing the middleware callback contract across an FFI is a good option here. My thinking here:

• It's an application state management helper, and managing the state behind an FFI other than performance reasons doesn't make a lot of sense—and crossing the FFI extra times is more likely a performance hit than a benefit. • It would save duplication of the core implementation code, however the FFI definitions would increase complexity of maintenance. • The middleware callback interface is already in Python (if there's ever a performance argument it might make sense to allow two types of middleware objects where the Flight implementation detects whether there's a pure-C++ one and dynamically wires in the callbacks that way), so using C++ code for this small component would make extension or modification by Python app authors much more difficult. • Even without the goal of modifying the included middleware implementation, Python app authors could easily want to simply look at the code to understand its behaviour in more detail, and reusing the C++ implementation would again present a significant barrier to users here. • Arrow is a multi-language set of bindings/SDK. Normally all of the bindings would be written in native languages, which would mean code duplication, but as it stands some interpreted languages are wrapped around the C++ implementation for performance reasons alone. But expecting to reduce code/implementation duplication across multiple supported languages as a goal in and of itself is not usually something to expect in such a multi-language library implementation. Perhaps I am missing something further about why it would in fact be for the best to use the C++ impl anyways?

— Reply to this email directly, view it on GitHub https://github.com/apache/arrow/issues/34016#issuecomment-1423303637, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACQB33FSP7D34VMWT2QFG3WWQKRBANCNFSM6AAAAAAUPWGM2I. You are receiving this because you were mentioned.Message ID: @.***>

lidavidm commented 1 year ago

Also: if Dremio is looking for a Python client, I thought 1) Dremio is using Flight SQL and 2) there is a Flight SQL client for Python now: https://github.com/apache/arrow-adbc/tree/main/python/adbc_driver_flightsql

lidavidm commented 1 year ago

I've tested this against dremio (OSS), would that solve your higher-level goal?

lidavidm commented 1 year ago

The middleware callback interface is already in Python (if there's ever a performance argument it might make sense to allow two types of middleware objects where the Flight implementation detects whether there's a pure-C++ one and dynamically wires in the callbacks that way), so using C++ code for this small component would make extension or modification by Python app authors much more difficult.

Basically, I'm confused here. The middleware interface is in C++. Python middleware is actually a special C++ middleware that makes FFI calls back into Python. What I'm saying is you can mostly avoid that dance: just allow directly adding the C++ cookie middleware to the client from Python.

indigophox commented 1 year ago

The middleware is part of the application-level connection state logic (albeit a helper). Arrow is explicitly the transport below this, so the middleware logic is part of the application. Accordingly a Python app developer should have ownership of this data and behaviour, but if we wrap the C++ black box and save the FFI hop, then: How will a Python developer understand (reading the code), debug, extend, learn from, etc. the provided middleware class?

This does not seem to me to provide a good developer experience for Python folks, while also creating more complicated maintenance around a dummy Python/Cython object that purports to implement the Python middleware class (which is already a defined interface) but then simply injects a native C++ class in a fairly nonstandard way.

As before I can see providing this type of native-code middleware binding as an option if it's seen to be a performance bottleneck for some high TPS Arrow users' use cases, but in any other circumstance (and as a default implementation) it seems to be a detriment to the Python app developer's experience.

lidavidm commented 1 year ago

If it's completely application layer: then it probably doesn't belong in the core library at all.

Most/all of the Flight implementation is already in C++, so there's not really a difference there.

There's not really much maintainer time to take on multiple implementations of everything. That is my primary worry. If Dremio is willing to help with maintenance here, then I think having the Python implementation is fine. If it goes unmaintained and unresolved issues arise, then it may get deprecated instead. (Though I suppose the middleware should be fairly trivial, since you can take advantage of the standard library, so that may not be a worry in the end.)

aiguofer commented 1 year ago

I've been playing around with the Python ADBC driver and noticed the cookies the the server was trying to set weren't being set by the client. I kind of assumed all Flight clients respected cookies, and would love to have that be the case.

The Python ADBC driver is interesting because it's not actually using the Python FlightClient, but instead it's using bindings into the Go ADBC driver, which uses the Go FlightClient. It seems neither of those currently have a Client Cookie Middleware. What's a good path forward to track this work? It seems like the Client cookie middleware implementations should live in this repo (like the java one does), and the Flight Clients created by the ADBC drivers should make sure to use them.