Avoiding metadata bloat caused by many long URLs

rsignell-usgs commented 3 years ago

I woke up this morning wondering whether it would be possible to allow a variable to be defined in the referencefile spec. I'm worried about long s3 (or other) urls bloating the metadata.

If we could do something like:

prefix001='first/superlongurl/that/keeps/going/on/and/on/for/ever/to/some/dir'
prefix002='second/superlongurl/that/keeps/going/on/and/on/for/ever/to/some/dir'

"key1": {
    ["s3://$prefix001/data001.nc", 10000, 100]
  }
"key2": {
    ["s3://$prefix001/data001.nc", 10100, 100]
  }
"key3": {
    ["s3://$prefix002/data001.nc", 10000, 100]
  }
"key4": {
    ["s3://$prefix002/data001.nc", 10100, 100]
  }

we could make the bloat a lot smaller

martindurant commented 3 years ago

Note that the current implementation in fsspec allows for a fallback default URL, for any entry where a URL is not specified, i.e., "[null, start, size]" (or None in python). That is not in the spec as reflected in this repo.

rsignell-usgs commented 3 years ago

@martindurant , I did not know that. How does one specify the fallback default URL?

martindurant commented 3 years ago

argument target= to ReferenceFileSystem (would be under storage_options as part of an Intake data source).

rsignell-usgs commented 3 years ago

In our example metadata file (s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d_offsets.json) replacing the explicit URLs below with "$U" reduces the size by 50%.

 "zeta/0.0": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 441750960, 1951430], 
 "zeta/0.1": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 443702390, 2912202], 
 "zeta/0.2": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 446614592, 2512331], 
 "zeta/0.3": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 449126923, 3092137],
 "zeta/0.4": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 452219060, 5634126],
 "zeta/0.5": ["s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc", 457853186, 6748304],

martindurant commented 3 years ago

Indeed, we might not use JSON as the only storage format in the long run. A binary format in which we can store the columns separately would give very nice compression for the case of repeated values (in the case of parquet, there is a specific encoding for this, although it is rarely used and poorly supported). Note that the key names also have a lot of repetition.

All that being said, I think we always want to support JSON too, so it's still worthwhile to come up with a way to declare global variables for the purpose of brevity and compactness. I slightly prefer "{U}" to "$U", since "$" is more likely to be a genuine character in a path, and {..} represents formatting in both python and JS (as opposed to this thing called "bash"). It would also potentially allow for more complicated formatting, so long as it is generalisable beyond python.

rsignell-usgs commented 3 years ago

@martindurant I agree with everything you say:

propose a binary format to store the metadata
keep the JSON format also
use {..} instead of $

manzt commented 3 years ago

I'm curious if Arrow would be of interest/suitable binary format. Mostly asking because I work on zarr-related things in the browser where support for something like parquet is lacking.

martindurant commented 3 years ago

I'm curious if Arrow would be of interest/suitable binary format.

What do you mean? Arrow is not a storage format.

In the context of the usage case of this repo, zarr would be a nice option maybe (it does have a JS implementation).

manzt commented 3 years ago

I'm pretty sure @manzt means parquet. 😄

I don't actually!

What do you mean? Arrow is not a storage format.

Apologies, I should be more clear. It is my understanding that Arrow defines a "language-independent columnar memory format for flat and hierarchical data". In an ecosystem like python, this usually means you store your data in some compressed columnar format (like parquet) and then use pyarrow to serialize columns in-memory and use the arrow "format" as a transport layer. However, the Arrow columnar format is portable and a pyarrow.Table can equally be mapped directly to disk or saved to an object store (and then memory mapped in all languages that have arrow support).

import pyarrow as pa

df = pd.DataFrame(/* */)
table = pa.Table.from_pandas(df)

with pa.RecordBatchFileWriter('dataset.arrow', table.schema) as writer:
    writer.write(table) # writes in-memory table to disk

# EDIT: as of Apache Arrow 0.17.0, the same data is written to disk using pyarrow.feather,
# since uncompressed Feather V2 is the Arrow representation on disk.
from pyarrow.feather import feather
feather.write_feather(df, 'dataset.feather', compression='uncompressed')

There is good support for reading the arrow tables in the browser, whether dynamically from a server or as as a static asset from an object store or web-server. If you look at the documentation for the JavaScript implementation of Arrow, you will see this exact use case: https://arrow.apache.org/docs/js/.

In JS we usually deal with our data in JSON, but Arrow is becoming increasingly popular because:

1.) It's so much faster to parse
2.) It's supported in many languages
3.) The arrow table equivalent of a JSON file is often much much smaller. You can use categorical dtypes for columns which means that repetitive entries (like those described above) will be much smaller in Arrows columnar representation (rather than the exact UTF-8 string repeated in JSON).

Additional reading:

Apache Arrow ("faster than JSON") – https://twitter.com/arnicas/status/1335179153763684353?s=21
Introduction to Apache Arrow (on the web) – https://observablehq.com/@theneuralbit/introduction-to-apache-arrow)
Apache Arrow with Arquero (Dataframes in JS, Jeffery Heer) – https://observablehq.com/@uwdata/arquero-and-apache-arrow

TL;DR - Arrow outside of Python ecosystem is used as a format and not just a transport layer. It's multi-language support makes it one of the most compelling binary targets for storing this type of metadata that meant to be accessed from many clients. In addition, if you think about it, JSON is just a common transport layer and we should probably be looking for a binary equivalent. (What I was proposing was essentially an Arrow Table (probably compressed with something like gzip) for a binary reference format.)

martindurant commented 3 years ago

I would refer to this as arrow IPC representation, i.e., meant for on-the-line data transfer, but I see what you mean. I don't know much about its performance characteristics - but "faster than JSON" isn't an amazing recommendation (JSON for readability, not speed). The main draw would be to be able to get a tabular representation in memory, rather than objects. This sort of suits the current spec, but not that well...

Your actual comparisons in this space would be msgpack, thrift, avro, protobuf... which are all compact, binary, and support complex (but well-defined) structures like this. Indeed, they come with schema definitions directly, so our discussion on the spec could happen in code. Most of them are not normally considered for on-disk storage (except avro?), even though all have good multi-language support.

martindurant commented 3 years ago

For the one example at "s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d_offsets.json", the JSON is 437,597 bytes and the equivalent msgpack 369,478, so not much difference (compresses to 68k, 60k with gzip, respectively) because of all of the repeated strings. msgpack loads significantly faster than json or ujson.

I did not try others, which would require me to craft a schema.

manzt commented 3 years ago

Thanks for the clarification. I guess I'm fairly naive here, and speaking mostly from how I've observed the adoption of Arrow on the web for columnar data (which often feels like it's used as a storage format).

The main draw would be to be able to get a tabular representation in memory, rather than objects. This sort of suits the current spec, but not that well...

Agreed. I brought up Arrow because of the mention of parquet, and how the current spec could probably be expressed in columnar form.

Your actual comparisons in this space would be msgpack, thrift, avro, protobuf...

Fair. I primarily wanted to voice consideration for a binary target that has multi-language support. (Not to say it wouldn't be a consideration!) Context: I work on the js implementation of zarr.

martindurant commented 3 years ago

To be sure, I have a bias against arrow, in that the installation in python is enormous, and in many zarr/xarray or other array-based workloads would only be doing this one job.

Context: I work on the js implementation of zarr.

So exactly, zarr could be the target, I suspect it's much smaller in code than arrow and/or parquet - but less well known, of course. You can also store arrow in zarr https://github.com/zarr-developers/zarr-python/issues/515

manzt commented 3 years ago

In that the installation in python is enormous, and in many zarr/xarray or other array-based workloads would only be doing this one job.

I think this is a really sensible concern and something I was not aware of.

Just for the sake of completeness of this thread, and because @kylebarron pointed it out to me, apparently Feather V2 storage format is the Arrow IPC representation serialized to disk: https://arrow.apache.org/docs/python/feather.html

"Version 2 (V2), the default version, which is exactly represented as the Arrow IPC file format on disk. V2 files support storing all Arrow data types as well as compression with LZ4 or ZSTD. V2 was first made available in Apache Arrow 0.17.0."

I think the concern with adding pyarrow as a dependency for array-based workloads makes total sense, but something to be aware of.

JoostvDoorn commented 1 year ago

I think the potential for this ecosystem of fsspec / kerchunk / zarr is pretty great in handling large datasets in their original format. I have been using tiff_to_zarr on large datasets of thousands of tiff files, and the memory issue here is pretty significant. I have been circumventing it for now by implementing a custom implementation of ReferenceFileSystem where the array chunk ranges are stored in memory in a pandas dataframe. This significantly reduces the memory footprint. This is sufficient for the POC I am doing, but a more thought out solution would be welcome.

Some specific elements to consider:

"zeta/0.0" -> storing each and every chunk key as a string takes significant memory (I use a tree instead).
"s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d.nc" -> if only one url is used in an zarr array you don't need to repeat it for each chunk. E.g. you could store it in "zeta/.arrayurl" or something else.

martindurant commented 1 year ago

ReferenceFileSystem allows you an optional argument target=, which will replace any URLs which are None. Also, you can pass templates, so that the URL might be "{{u}}" for many keys, and the templates dict contains {"u": "some://path"}. The latter mechanism is part of the spec: the templates should be stored in the input references set, but you can also pass template_overrides= if you need to have different ones at runtime.

Finally, we have found that compressing the JSON with zstd does a really nice job of compacting the repeated strings on disk, so that the methods above become unnecessary.

I would be interested to know what you "tree" implementation is like.

In an upcoming version of kerchunk/referenceFS, we will be implementing parquet storage for references, which should help with both the on-disk and in-memory footprint, but might still have a lot of string duplication after load (this remains to be seen).

fsspec / kerchunk

Avoiding metadata bloat caused by many long URLs #13