car fsspec reference file system output

thewtex commented 2 years ago

@d70-t what do you think about outputing an fsspec ReferenceFileSystem representation:

https://fsspec.github.io/kerchunk/spec.html

for the resulting car files? The motivation is to store the car files locally or AWS S3, e.g., and access them in Python while removing any need for an IPFS server or gateway.

d70-t commented 2 years ago

That's a very neat idea!

Do you want to give it a try? -- Otherwise I might also have a try on it, as I've been playing around with building reference files for grib data a few days ago.

thewtex commented 2 years ago

I may get there, but I would not be surprised if you beat me to it :-)

d70-t commented 2 years ago

😬 ... there might be an issue though: the 2MiB block size limit which is necessary to protect users from being DoS-ed by bad peers on IPFS. TLDR: you can only verify the correctness of a block after computing the hash of the entire thing. If someone sends you 100GB, you'd have to accept it only to throw it away after recognizing that it's just garbage.

Anyways, as a consequence, one might have to split zarr chunks up into smaller blocks using e.g. Flexible Byte Layout (FBL) if it should be possible to transfer the blocks via IPFS. But as far as I understand the ReferenceFileSystem spec, it's not possible to assemble a fsspec-file from multiple parts of a (CAR-) file, which would be required to translate FBL into refs.

As long as zarr chunks would stay below the 1 MiB soft-limit or the 2 MiB hard-limit, FBLs wouldn't be required and ReferenceFileSystem should be simple to build.

Probably this isn't too bad though, as #1 could provide a nice way of packing the small 1MiB chunks into larger objects again, which would be pretty close to zarr sharding zarr-developers/zarr-python#877.

thewtex commented 2 years ago

Yes, the block size limit is a constraint. For my use cases, chunks that are 256x256 or 64x64x64 in size would be OK. Yes, maybe we need sharding support in zarr before this could be used generally.

joshmoore commented 2 years ago

A sidenote that I mentioned to @thewtex, I think the second prototype from @jstriebel (https://github.com/zarr-developers/zarr-python/pull/947) since it doesn't touch the store itself may be testable with this store?!

d70-t commented 2 years ago

@joshmoore, I'd say yes and no 😬 ... Technically, it should work on this store, as zarr would pass on "shards" to this store in stead of "chunks", but for the store it would look just the same.

However, problems would arise indirectly due to the increased size of shards when compared to chunks. This is at least for two (related) reasons:

verifying correctness

This store builds up a Merkle-DAG from the pieces (shards / chunks) provided to the store. Usually (e.g. without any inlining or splitting of objects), each of these pieces whould be hashed as one thing and put into the DAG. If correctness of the data should be verified, the entire piece has to be downloaded and the hash has to be computed again. If a reader only wants to obtain a smaller "chunk" from a larger "shard", the reader can't verify correctness of the chunk. The same problem would arise if any form of Merkle-Tree would be used as discussed in zarr-developers/zarr-python#392.

obtaining data from untrusted sources

If data should be received from untrusted sources (e.g. as common in p2p networks), verification of the pieces is a lot more important. For this reason, p2p networks usually place a hard upper limit on the size of individually hashed blocks. In case of bitswap (the IPFS network protocol), this is 2 MB. So it would be possible to encode larger blocks using ipldstore, but you would not be able to pass them around afterwards.

what could we do?

The main issue here is that hashes must be computed on a tree of smaller pieces in order to be able to verify parts of the array. This looks a lot like the "compression problem": we also want to compress only small chunks of the array in order to be able to read smaller parts. We then pack many of those parts into a larger shard, such that we can store multiple of those pieces together. If the shard would be arranged as a "little Merkle Tree" itself, we would be able to verify the correctness of individual chunks without having to download the entire shard.

option 1: hashing on chunk-level

This would require to have the hashes / CIDs / checksums of each "chunk" listed within some toplevel object, which could be the shard index. The toplevel object would then have to be hashed itself, such that there would be one single hash value for the entire shard, which can be used to verify the correctness of all the chunk-hashes. However, we maybe don't want to include the offset positions of the chunks within the shard as part of the toplevel hash, because then hashes would change if the order of chunks within the shard would be modified. Thus, a "verifyable shard" might conceptually look like:

#index section
root offset, root size, chunk 0 offset, chunk 0 size, chunk 1 offset, chunk 1 size...
#root section <- that's what makes up the toplevel hash
chunk 0 hash, chunk 1 hash, chunk 2 hash ... (in some deterministic order)
#chunk section
chunk 0 data, chunk 1 data, ...

A possible (maybe not ideal) format for such a "verifyable shard" would be something like CARv2. It's a bit redundant for this particular usecase, but maybe not too much...

option 2: hashing on arbitrary slices of the shard

Another option would be to slice up the shards again on arbitrary boundaries, build the same kind of Merkle-Tree as above on top of the slices and store that in a similar format as the "verifyable shard" above. Thats about what the "flexibly byte layout" does. An advantage of this would be, that it would work on arbitrary chunk sizes (even if they are larger than any limit of any transport mechanism). The disadvantage would be, that there'll be quite a lot of duplicated work. Also I'd guess that "double chunking" turns out to be quite bad for performance.

I don't yet have a clear picture which "route to sharding" would be best to get there. But I can't really see a way around hashing on chunk-level, if we want any kind of checksumming. In the ipldstore-world, sharding might appear naturally if parts of the trees are exported as CAR (see #1). While this doesn't help directly for other stores, this whole story might be a little hint that implementing sharding on store-level (or as a store-transformation layer) could be a better choice than implementing it on the array level.

thewtex commented 2 years ago

I don't yet have a clear picture which "route to sharding" would be best to get there. But I can't really see a way around hashing on chunk-level, if we want any kind of checksumming. In the ipldstore-world, sharding might appear naturally if parts of the trees are exported as CAR (see #1). While this doesn't help directly for other stores, this whole story might be a little hint that implementing sharding on store-level (or as a store-transformation layer) could be a better choice than implementing it on the array level.

I completely agree on this, and I arrived at this conclusion, also! I am going to take a stab at implementing a SharedStore. :thought_balloon:

thewtex commented 2 years ago

4

Tried this brilliantness! Looks to be very close, but I get an error?

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [58], in <cell line: 1>()
----> 1 reference_fs = car2reference_fs(f'{image_name}.zarr.car')
      2 print(reference_fs)

File ~/bin/mambaforge/envs/spatial-image/lib/python3.9/site-packages/ipldstore/car_reference_fs.py:59, in car2reference_fs(filename)
     57 def car2reference_fs(filename: str) -> Dict[str, Any]:
     58     with open(filename, "rb") as stream:
---> 59         refs = car2reference_fs_refs(stream, "{{a}}")
     60     return {"version": 1, "templates": {"a": filename}, "refs": refs}

File ~/bin/mambaforge/envs/spatial-image/lib/python3.9/site-packages/ipldstore/car_reference_fs.py:32, in car2reference_fs_refs(stream_or_bytes, stream_name)
     29 def car2reference_fs_refs(stream_or_bytes: StreamLike, stream_name: str) -> Dict[str, Any]:
     30     root, cbor_objects, object_locations = collect_tree_objects(stream_or_bytes)
---> 32     tree = dag_cbor.decode(cbor_objects[root])
     33     assert isinstance(tree, dict)
     34     sep = "/"

KeyError: CID('base58btc', 1, 'dag-cbor', '1220c834b4567f834677ddbcf77d1858120536d76ea9474e8264419268f29acbebe5')

Reproducible locally with:

Edit: https://github.com/thewtex/multiscale-spatial-image/blob/tifffile-zarr-car/examples/ConvertTiffFile.ipynb

d70-t commented 2 years ago

😬 that's unfortunate...

I couldn't exactly reproduce your notebook (to_multiscale didn't return scales other than 0), however, I was able to build the references. I figured that this is due to the version of the multiformats package. There had been an issue about how CIDs have been compared to each other which was solved in hashberg-io/multiformats#2 and merged, however this has not yet been released to pypi...

thewtex commented 2 years ago

There had been an issue about how CIDs have been compared to each other which was solved in hashberg-io/multiformats#2

oi !

Also, it seems we need the IPFS implementations to catch up so they can cat dag-cbor :-(

d70-t commented 2 years ago

Also, it seems we need the IPFS implementations to catch up so they can cat dag-cbor :-(

yes, that's the big issue with the direct IPLD-approach... However, it's planned (ipfs/go-ipfs#8640) for the next release of go-ipfs to support raw blocks and CAR export from gateways. This would likely simplify the use of generic IPLD structures over gateways.

The other option probably would be to build additional unixfs trees on top of the IPLD raw-leafs (#3). Most of the data would not have to be duplicated, but still data would be available in both worlds... I'm really torn between "zarr on unixfs on IPLD" and "zarr on IPLD". The former seems to be more compatible whereas the latter seems to be more elegant (and may be more useful also as a StorageTransformer for other storage backends).

Anyways, to go forward on this issue, you could either try installing multiformats from git, or we could write out own keys in the cbor_objects-dict, which would be a workaround for the mentioned bug.

thewtex commented 2 years ago

it's planned (ipfs/go-ipfs#8640) for the next release of go-ipfs to support raw blocks

Great! -- raw-blocks + fsspec on gateways could be a big performance bump!

try installing multiformats from git,

Worked! :tada: :taco:

thewtex commented 2 years ago

I am going to take a stab at implementing a SharedStore.

https://github.com/thewtex/shardedstore :tada:

joshmoore commented 2 years ago

I feel like I should applaud but I'm a bit lost as to what for :)

Are you guys up for a pres., demo, or blog at some point?

cc: @MSanKeys963

d70-t commented 2 years ago

Hey @joshmoore thanks for getting back!

There's probably not too much to applaud for yet, it's still work in progress. But in principle, this ipldstore is able to pack a bunch of key-value pairs into a content addressable archive (CAR). Using something like @thewtex's shardedstore, one could easily have one ipldstore per shard. That way, one could quickly build one CAR per shard. Using CAR as a serialization format for shards would include all the checksums for all chunks and metadata, so correctness could be verified. And using the reference-fs / kerchunk implementation from #4 it would be possible to quickly index into individual chunks within each CAR.

Probably later on, one might want to use (indexed) CARv2 instead and an CARv2-fs on top. Conceptually that's the same as using the reference-fs, however, the references would already be built in and don't have to be carried around separately.

joshmoore commented 2 years ago

Hmmm..... so what's not working? :smile:

d70-t commented 2 years ago

Ok, so it's currently possible to create one large CAR of zarr-on-IPLD objects using ipldstore on a single machine. It's also possible to build a reference filesystem based off this CAR (if using the current HEAD of multiformats). So probably we should close this issue and if you think it helps, make some post / presentation out of it. It would basically be:

put some zarr into ipldstore and make a CAR out of it
make a reference filesystem out of it
open it using xr.open_zarr("reference::...")

On its own, it's more or less a complicated way of storing a zarr into a zip store (or any other single-file backend), but it helps in exploring :-)

What's not yet implemented, but has been discussed a bit here (probably out of scope of this issue) would be:

splitting the CAR into multiple cars (to use this for sharding)
use CARv2 instead of the reference-filesystem-json as s shard format (to keep checksums / CIDs visible)
use this for verification (for that to work, we might not want to use reference filesystem, but ipldstore instead, because by creating the reference filesystem, we basically have to strip off the checksums)
renumbering chunks (as suggested here) to facilitate splitting the cars

observingClouds commented 2 years ago

Hi,

I've been working on a similar issue behind the scene. I was searching for a way to

verify CARs of zarr files locally, especially those where one zarr file is splitted across CARs
access their contents directly

I basically combined all the great work you did and added a few extra lines to build car_referencer. So now you could do something like

car_referencer -c "carfiles.*.car" -p preffs.parquet -r ROOT-HASH

import xarray as xr
ds = xr.open_zarr("preffs:preffs.parquet")

Maybe it is helpful for you as well. Feedback welcome.

d70-t / ipldstore