d70-t / ipldstore

zarr backend store using IPLD datastructures
MIT License
6 stars 7 forks source link

car sharding #1

Open thewtex opened 2 years ago

thewtex commented 2 years ago

@d70-t awesome work!!

What do you think about car sharding? That is, for a large dataset, the arrays in the dataset are split into separate car files. For large arrays, they may be split into multiple car files like carbites.

The motivation is to support upload to tools like web3.storage, estuary.tech for datasets larger than 32 GB.

d70-t commented 2 years ago

Yes, car sharding should be there! I'm just wondering a bit if it should be part of the ipldstore or separate?

We could use car sharding for packing small chunks into larger objects for an object store (e.g. S3) and we could also pack even more cars into a larger car to send them off to some "high(er) latency storage" like tape or filecoin etc...

I'd expect this to work particularly well if the CARs contain groups which one would normally use for zarr-sharding. Depending on how the zarr keys are built (morton code?), this may or may not correspond to the carbites Treewalk method.

Probably CARv2 (== CAR + index) would even make up a nice structure for zarr shards.

thewtex commented 2 years ago

Yes, I do not know if it should be part of the ipldstore or something separate.

I am thinking about the use case of building a large zarr dataset on a cluster.

CARv2 does look quite nice and helpful!

d70-t commented 2 years ago

Distributed writes should definitely be possible. My guess would be that we want to have one ipldstore per worker / thread etc... As a result, every worker would only see what has been written by itself (or what would have been preloaded into the ipldstore before). I believe that this should be ok for most use cases, but I don't know for sure yet.

The resulting IPLD objects would be dumped somewhere by whatever transport method suits best. An option would be to generate CARs locally and send them off.

The roots (i.e. what freeze() returns) of each ipldstore would have to be collected and sent to some function which merges the individual trees into one. This function would only operate on higher level blocks and CIDs of the leaves, so it would need much less resources and could probably run on a single node. However, it would also be possible to scatter that to multiple nodes (e.g. by doing one job per variable etc...). The idea of a CRDT may come in handy at this point. It may be helpful (e.g. to bypass the time required for sending big CARs to storage) to create additional partial CARs on each worker which contain only those pieces which might be required by the merging function and collect those together with the roots.

d70-t commented 2 years ago

Another question may be if some reorganization of the "big" CARs would be needed, in case the set of IPLD objects created by one worker doesn't match the sharding one would like to have on read. But maybe that should be regarded as an implementation detail of the system which ends up hosting the CARs / IPLD objects.