Re-design Repository and Storage APIs

DylanLukes commented 1 year ago

(Re)define an API for renkon.repo.Repo suitable for use as a Flight backend. Main changes include adding support for fetching schemas and metadata without loading the whole dataset (necessary for returning descriptors).

DylanLukes commented 1 year ago

API sketch:

get_table_schema(resource_path: str)
- Implement with pa.ipc.read_schema(...)
get_table_total_bytes(resource_path: str)
- Use Path('file.arrow').stat().st_size or similar.
get_table_total_records(resource_path: str)
- Memory map the .arrow and read from the header.
get_table(resource_path: str)
put_table(resource_path: str)

Notes:

The first three methods can potentially be removed if it is assumed we always write Arrow IPC to disk, but down the line it may be preferable to store output tables in Parquet format. We use Arrow IPC for input files since they will be processed by N workers at once, and this avoids decoding N copies into memory. On the other hand for outputs which are written once, Parquet is more optimal in terms of disk usage and bandwidth returning results to the client.

Currently, the repository/store API has distinct *_input_* and *_output_ methods. These could be eliminated by having designated input/... and output/... paths.

Some further thought is needed for transformation between logical resource paths and filesystem paths.

DylanLukes commented 1 year ago

Storage abstraction implemented in b4e5d5b34d9b35c2d7ef433bf84259a0d76bbc12 and tested in c38c10e7eb2648035478f8dcbbfef814b6192132.

Covers most of the above, and supports both .parquet and .arrow files.

Goal is for Repository to abstract away any details of storage, dealing only in pure paths (which may be on disk, remote, etc...). This is important because an input table may not actually be stored on disk if the client (e.g. Jupyter Lab) hands off a handle to shared memory.

Therefore, the Repository works strictly at the abstraction level of Arrow arrays in/mapped to memory.

DylanLukes / renkon

Re-design Repository and Storage APIs #18