Open rsignell-usgs opened 3 years ago
Note that the current implementation in fsspec allows for a fallback default URL, for any entry where a URL is not specified, i.e., "[null, start, size]" (or None
in python). That is not in the spec as reflected in this repo.
@martindurant , I did not know that. How does one specify the fallback default URL?
argument target=
to ReferenceFileSystem (would be under storage_options as part of an Intake data source).
In our example metadata file (s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d_offsets.json) replacing the explicit URLs below with "$U" reduces the size by 50%.
"zeta/0.0": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 441750960, 1951430],
"zeta/0.1": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 443702390, 2912202],
"zeta/0.2": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 446614592, 2512331],
"zeta/0.3": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 449126923, 3092137],
"zeta/0.4": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 452219060, 5634126],
"zeta/0.5": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 457853186, 6748304],
Indeed, we might not use JSON as the only storage format in the long run. A binary format in which we can store the columns separately would give very nice compression for the case of repeated values (in the case of parquet, there is a specific encoding for this, although it is rarely used and poorly supported). Note that the key names also have a lot of repetition.
All that being said, I think we always want to support JSON too, so it's still worthwhile to come up with a way to declare global variables for the purpose of brevity and compactness. I slightly prefer "{U}" to "$U", since "$" is more likely to be a genuine character in a path, and {..} represents formatting in both python and JS (as opposed to this thing called "bash"). It would also potentially allow for more complicated formatting, so long as it is generalisable beyond python.
@martindurant I agree with everything you say:
I'm curious if Arrow would be of interest/suitable binary format. Mostly asking because I work on zarr-related things in the browser where support for something like parquet is lacking.
I'm curious if Arrow would be of interest/suitable binary format.
What do you mean? Arrow is not a storage format.
In the context of the usage case of this repo, zarr would be a nice option maybe (it does have a JS implementation).
I'm pretty sure @manzt means parquet. 😄
I don't actually!
What do you mean? Arrow is not a storage format.
Apologies, I should be more clear. It is my understanding that Arrow defines a "language-independent columnar memory format for flat and hierarchical data". In an ecosystem like python, this usually means you
store your data in some compressed columnar format (like parquet) and then use pyarrow
to serialize columns in-memory and use the arrow "format" as a transport layer. However, the Arrow columnar format is portable and a pyarrow.Table
can equally be mapped directly to disk or saved to an object store (and then memory mapped in all languages that have arrow support).
import pyarrow as pa
df = pd.DataFrame(/* */)
table = pa.Table.from_pandas(df)
with pa.RecordBatchFileWriter('dataset.arrow', table.schema) as writer:
writer.write(table) # writes in-memory table to disk
# EDIT: as of Apache Arrow 0.17.0, the same data is written to disk using pyarrow.feather,
# since uncompressed Feather V2 is the Arrow representation on disk.
from pyarrow.feather import feather
feather.write_feather(df, 'dataset.feather', compression='uncompressed')
There is good support for reading the arrow tables in the browser, whether dynamically from a server or as as a static asset from an object store or web-server. If you look at the documentation for the JavaScript implementation of Arrow, you will see this exact use case: https://arrow.apache.org/docs/js/.
In JS we usually deal with our data in JSON, but Arrow is becoming increasingly popular because:
Additional reading:
TL;DR - Arrow outside of Python ecosystem is used as a format and not just a transport layer. It's multi-language support makes it one of the most compelling binary targets for storing this type of metadata that meant to be accessed from many clients. In addition, if you think about it, JSON is just a common transport layer and we should probably be looking for a binary equivalent. (What I was proposing was essentially an Arrow Table (probably compressed with something like gzip) for a binary reference format.)
I would refer to this as arrow IPC representation, i.e., meant for on-the-line data transfer, but I see what you mean. I don't know much about its performance characteristics - but "faster than JSON" isn't an amazing recommendation (JSON for readability, not speed). The main draw would be to be able to get a tabular representation in memory, rather than objects. This sort of suits the current spec, but not that well...
Your actual comparisons in this space would be msgpack, thrift, avro, protobuf... which are all compact, binary, and support complex (but well-defined) structures like this. Indeed, they come with schema definitions directly, so our discussion on the spec could happen in code. Most of them are not normally considered for on-disk storage (except avro?), even though all have good multi-language support.
For the one example at "s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d_offsets.json", the JSON is 437,597 bytes and the equivalent msgpack 369,478, so not much difference (compresses to 68k, 60k with gzip, respectively) because of all of the repeated strings. msgpack loads significantly faster than json or ujson.
I did not try others, which would require me to craft a schema.
Thanks for the clarification. I guess I'm fairly naive here, and speaking mostly from how I've observed the adoption of Arrow on the web for columnar data (which often feels like it's used as a storage format).
The main draw would be to be able to get a tabular representation in memory, rather than objects. This sort of suits the current spec, but not that well...
Agreed. I brought up Arrow because of the mention of parquet, and how the current spec could probably be expressed in columnar form.
Your actual comparisons in this space would be msgpack, thrift, avro, protobuf...
Fair. I primarily wanted to voice consideration for a binary target that has multi-language support. (Not to say it wouldn't be a consideration!) Context: I work on the js implementation of zarr.
To be sure, I have a bias against arrow, in that the installation in python is enormous, and in many zarr/xarray or other array-based workloads would only be doing this one job.
Context: I work on the js implementation of zarr.
So exactly, zarr could be the target, I suspect it's much smaller in code than arrow and/or parquet - but less well known, of course. You can also store arrow in zarr https://github.com/zarr-developers/zarr-python/issues/515
In that the installation in python is enormous, and in many zarr/xarray or other array-based workloads would only be doing this one job.
I think this is a really sensible concern and something I was not aware of.
Just for the sake of completeness of this thread, and because @kylebarron pointed it out to me, apparently Feather V2 storage format is the Arrow IPC representation serialized to disk: https://arrow.apache.org/docs/python/feather.html
"Version 2 (V2), the default version, which is exactly represented as the Arrow IPC file format on disk. V2 files support storing all Arrow data types as well as compression with LZ4 or ZSTD. V2 was first made available in Apache Arrow 0.17.0."
I think the concern with adding pyarrow
as a dependency for array-based workloads makes total sense, but something to be aware of.
I think the potential for this ecosystem of fsspec / kerchunk / zarr is pretty great in handling large datasets in their original format. I have been using tiff_to_zarr on large datasets of thousands of tiff files, and the memory issue here is pretty significant. I have been circumventing it for now by implementing a custom implementation of ReferenceFileSystem where the array chunk ranges are stored in memory in a pandas dataframe. This significantly reduces the memory footprint. This is sufficient for the POC I am doing, but a more thought out solution would be welcome.
Some specific elements to consider:
ReferenceFileSystem allows you an optional argument target=
, which will replace any URLs which are None. Also, you can pass templates, so that the URL might be "{{u}}" for many keys, and the templates dict contains {"u": "some://path"}. The latter mechanism is part of the spec: the templates should be stored in the input references set, but you can also pass template_overrides=
if you need to have different ones at runtime.
Finally, we have found that compressing the JSON with zstd does a really nice job of compacting the repeated strings on disk, so that the methods above become unnecessary.
I would be interested to know what you "tree" implementation is like.
In an upcoming version of kerchunk/referenceFS, we will be implementing parquet storage for references, which should help with both the on-disk and in-memory footprint, but might still have a lot of string duplication after load (this remains to be seen).
I woke up this morning wondering whether it would be possible to allow a variable to be defined in the referencefile spec. I'm worried about long s3 (or other) urls bloating the metadata.
If we could do something like:
we could make the bloat a lot smaller