Closed DylanLukes closed 1 year ago
API sketch:
get_table_schema(resource_path: str)
pa.ipc.read_schema(...)
get_table_total_bytes(resource_path: str)
Path('file.arrow').stat().st_size
or similar.get_table_total_records(resource_path: str)
.arrow
and read from the header.get_table(resource_path: str)
put_table(resource_path: str)
Notes:
The first three methods can potentially be removed if it is assumed we always write Arrow IPC to disk, but down the line it may be preferable to store output tables in Parquet format. We use Arrow IPC for input files since they will be processed by N
workers at once, and this avoids decoding N
copies into memory. On the other hand for outputs which are written once, Parquet is more optimal in terms of disk usage and bandwidth returning results to the client.
Currently, the repository/store API has distinct *_input_*
and *_output_
methods. These could be eliminated by having designated input/...
and output/...
paths.
Some further thought is needed for transformation between logical resource paths and filesystem paths.
Storage
abstraction implemented in b4e5d5b34d9b35c2d7ef433bf84259a0d76bbc12 and tested in c38c10e7eb2648035478f8dcbbfef814b6192132.
Covers most of the above, and supports both .parquet
and .arrow
files.
Goal is for Repository
to abstract away any details of storage, dealing only in pure paths (which may be on disk, remote, etc...). This is important because an input table may not actually be stored on disk if the client (e.g. Jupyter Lab) hands off a handle to shared memory.
Therefore, the Repository
works strictly at the abstraction level of Arrow arrays in/mapped to memory.
(Re)define an API for
renkon.repo.Repo
suitable for use as a Flight backend. Main changes include adding support for fetching schemas and metadata without loading the whole dataset (necessary for returning descriptors).