filecoin-project / dagstore

a sharded store to hold large IPLD graphs efficiently, packaged as location-transparent attachable CAR files, with mechanical sympathy
Other
42 stars 24 forks source link

feat: register identity hash roots as existing in shards #138

Open rvagg opened 2 years ago

rvagg commented 2 years ago

This is only a proposal for the purpose of discussion around https://github.com/filecoin-project/boost/pull/715, if we go this route there'd be more testing needed.

Summary of the problem

Storage providers have pieces stored where the root of the CAR is an identity multihash which is not also stored as an indexable section within the CAR. This is treated as the PayloadCID by clients, legitimately so. This happens as part of UnixFS creation, even lotus import is doing it: https://github.com/filecoin-project/lotus/blob/28722de72dce22c7ef41fd5442ec3fac0f524a9f/lib/unixfs/filestore.go#L37-L40

Then, when retrieving via this PayloadCID, we try to map it to a piece using the normal "which pieces contain this CID" functions afforded by the Dagstore. But because that CID isn't included in a CARv2 index, it's not found, the mapping fails and the retrieval is rejected.

Solutions re Dagstore

One possible solution (there are others being considered, see https://github.com/filecoin-project/boost/pull/715) is to make the Dagstore aware of these roots and get the lookup to successfully map an identity CID root to that payload. We could either:

  1. Add a new property to the inverted index that allows us to explicitly query for roots, which might be a useful feature in general - "which CARs have this CID as a root?"
  2. Including the identity CID in the index for the CAR, as if it were stored as a block, with no distinction.

This PR does option 2. The reason this works is because CARv2's blockstore interface will return identity CID bodies without bothering to look them up regardless of whether they are in the blockstore or not (arguably the right behaviour for any blockstore, maybe not if you want a strict "only if you have it" though): https://github.com/ipld/go-car/blob/1478bbd911efbe3735f3f2e909353c90137a8837/v2/blockstore/readonly.go#L271-L280. Then when asked "which shards have this CID", the Dagstore will return the right answer for root identity CIDs, and then fetching them should also work. So it's not even necessarily a hack: the CAR does have that identity CID, and the blockstore will return it when asked for it. We just lack a bit of explicit information about it being the PayloadCID.