apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.94k stars 3.4k forks source link

[Python] Create Python examples of HTTP GET Arrow client/server supporting range requests #40597

Open ianmcook opened 4 months ago

ianmcook commented 4 months ago

Describe the enhancement requested

Contribute Python client and server examples to the HTTP GET range request examples in the arrow-experiments repo. This should demonstrate a client/server pair that sends/receives data of known size (Content-Length response header) in the Arrow IPC streaming format, and supports range requests (Accept-Range: bytes and Content-Range: bytes response headers; Range: bytes request header).

The main purpose of this example should be to demonstrate that range requests can be used to resume interrupted GET requests. See the discussion in the comments below for other possible uses of range requests.

Component(s)

Python

ianmcook commented 4 months ago

For convenience, this server example could serve the data in the file contributed by @paleolimbot here: https://github.com/apache/arrow-experiments/blob/main/data/arrow-commits/arrow-commits.arrows

ianmcook commented 4 months ago

Range requests are how HTTP clients resume interrupted downloads.

To test whether a server supports range requests, you can use curl like this:

  1. Start the download, limiting the speed so it doesn't finish too quickly.
    curl --limit-rate 10K -o file.arrows http://localhost:8008
  2. Press ^C to interrupt the download.
  3. Resume the download.
    curl --limit-rate 10K -o file.arrows -C - http://localhost:8008
kou commented 4 months ago

If we want to add examples for range requests, we may want to use not only the stream format but also the file format. We can download only needed record batches (and footer) with the file format and range requests.

ianmcook commented 4 months ago

@kou Do you mean that a client could send a range request like Range: batches=x-y instead of Range: bytes=x-y? In that case: yes, the server would be more efficient retrieving the requested batches if the data on the server side was in the IPC file format, because the footer contains memory offsets and sizes for each record batch.

But I am -1 on recommending the use of range requests with units that are not bytes. Although this is allowed by HTTP/1.1 (as described in RFC 2616 Section 3.12) and also by HTTP/2 (as described in RFC 7540 Section 8), HTTP clients and servers in general do not support this well. At best it would require overriding classes/behaviors of the HTTP server libraries that are rarely overridden. At worst it would be altogether incompatible with some HTTP clients and servers.

I think it is better if we recommend that HTTP APIs should handle requests for specific ranges of batches using whatever higher-level application-specific methods they choose (such as URL query parameters) and restrict the use of range requests to bytes units only.

paleolimbot commented 4 months ago

I think the idea with the file format might be more like: grab the last 8 bytes of the file (which I think contains a magic number and the number of footer bytes), then grab the footer (which contains offsets for various message locations), then grab specific batches, perhaps in parallel.

I don't know if it's worth documenting, but I wonder if APIs would want to serve something like the footer metadata (which includes the offsets) via whatever API the client is calling to get the URI to the data in the first place, then the client could use range requests to read specific batches (or split up the fetch in parallel using a thread pool) in the same way.

ianmcook commented 4 months ago

Ok, that makes sense in general, but I don't think it helps with handling byte range requests. For byte range requests, we can just treat IPC stream files as opaque bytes.

ianmcook commented 4 months ago

Re the idea of an API serving metadata similar to the IPC file format footer which the client could use to make range requests:

Perhaps, but I suspect that the real-world usefulness of that would be minimal. I can't envision a compelling case where a client user would want to retrieve a subset of data by looking up an ordinal byte range as opposed to passing more contextually meaningful query parameters.

paleolimbot commented 4 months ago

I can't comment on the utility bit, you are almost certainly correct!

I had envisioned the utility of range requests for the situation where you have a server set up to serve static files (because this is very easy to do) and you wanted to push the responsibility of issuing a partial (or partitioned) read on to the client. If that's not what you're trying to do here, ignore me!

ianmcook commented 4 months ago

Ah, ok, I see what you mean now. You have a "dumb" HTTP server that is just serving files from a directory structure and doesn't know anything about Arrow. In that case, if the files are in the Arrow IPC file format and the server supports range requests, then the client can take advantage of the footers to download specific record batches, schemas, or other data blocks from the files.

This is perhaps a bit of an obscure case; we don't recommend using the IPC file format for archival storage so there are not many large collections of Arrow IPC files as far as I know. Parquet files are vastly more common. But it might be worth describing this in a section of the Arrow-over-HTTP conventions document.

paleolimbot commented 4 months ago

Got it! Perhaps what should be documented for this issue is why one would bother supporting range requests (I'm sure there's a good reason, but it's not clear to me what it is if it's not partial reads aligned on batches).

ianmcook commented 4 months ago

Resuming interrupted downloads is the main case. And for that I think it boils down to "just treat the IPC stream data as opaque bytes."

ianmcook commented 4 months ago

I think a cool thing to create for this example would be a Python client that can recover from a network disconnection and resume downloading (using a range request) after the network reconnects.

kou commented 4 months ago

I should have explained more. Sorry. @paleolimbot explained all what I wanted to say. :-) Thanks!

I agree with resuming downloading is only enough for this case.