apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.51k stars 3.53k forks source link

[Python] Create Python examples of indirect HTTP GET Arrow client and server #40596

Open ianmcook opened 7 months ago

ianmcook commented 7 months ago

Describe the enhancement requested

Contribute Python client and server examples to the indirect HTTP GET examples in the arrow-experiments repo. This should demonstrate how to use a two-step sequence to retrieve Arrow data:

  1. The client sends a GET request to a server and receives a JSON response from the server containing one or more server URIs.
  2. The client sends GET requests to each of those URIs (in parallel) and receives a response from each server containing an Arrow IPC stream of record batches (exactly as in the simple GET examples).

Component(s)

Python

ianmcook commented 7 months ago

@CurtHagenlocher asked:

Any thoughts on the "best" canonical way to return multiple record batches? Would that also be multipart/mixed or would it be better to avoid the delimiter problem and e.g. use an alternate Content-Type indicating that the response contains multiple streams?

By "multiple record batches" do you mean record batches with different schemas? (Or maybe they have the same schema but it's important to keep them logically separated in separate IPC streams?) If that's what you mean, then I think the two-step / indirect approach described here is probably what we should generally recommend.

CurtHagenlocher commented 7 months ago

Yes, sorry, separate result sets with potentially-different schemas. I think the scenario here is visualization in the browser where the JS-based UI sends a single request for multiple results, each of which is relatively small. Having to do this as described here would mean having to maintain state on the server across multiple requests.

ianmcook commented 7 months ago

Ah, I see.

IIUC we do not have any facilities in IPC, Flight, Flight SQL, or ADBC that encapsulate multiple different-schema streams into one logical unit. And I don't think we're super eager to create anything like that. So using whatever facilities HTTP provides seems like the way to go.

So I think a multipart/mixed response (as described in #40598) seems like probably the best way to do this if you can't maintain state on the server side. I think if you choose a sufficiently obscure delimiter, the delimiter problem is exceedingly unlikely to be a real problem in practice, but we should research this more to better understand the risks.

ianmcook commented 7 months ago

we do not have any facilities in IPC, Flight, Flight SQL, or ADBC that encapsulate multiple different-schema streams into one logical unit

but there are some issues requesting this in ADBC: https://github.com/apache/arrow-adbc/issues/1447, https://github.com/apache/arrow-adbc/issues/1358

felipecrv commented 2 months ago

@CurtHagenlocher you should take a look at https://github.com/apache/arrow-experiments/pull/33 It specifies how Arrow streams can be served in multipart/mixed responses.