How to define whether a CID is retrievable?

bajtos commented 1 year ago

We had a discussion with @rvagg where he pointed out that there are different ways how to consider a CID as retrievable.

The simplest (but also least useful?) option is to fetch only the root block of the CID.
For end users fetching data from IPFS (e.g. via Saturn), it's most helpful to know that the entire DAG rooted in CID is retrievable. That may be expensive to verify, though, because the DAG can be spread across multiple storage providers.
For building the Reputation score of Storage Providers, we want to check that they are honouring the storage deals. In other words, we want to check that all blocks included in the deal are retrievable. These blocks do not necessarily have to form a single DAG.

We should explore this topic, discuss the matter with other groups interested in SPARK (@willscott & Bedrock, Reputation WG) and decide which approach should SPARK implement to decide whether a CID can be retrieved.

bajtos commented 1 year ago

Cross-posting @rvagg's comment from Slack (with his permission):

Re fetchability of CIDs - there’s an interesting question here about what it means for the CID to be fetchable from a Filecoin SP. Does it mean that just the block for that CID is fetchable? Or does it mean the entire DAG under that block is fetchable?

If you’re testing based on what the indexer claims then just a per-block fetch would be reasonable, ask for one block, get it, :white_check_mark:

If you’re testing based on chain or some other deal data, that the SP has provably made a deal with a particular root CID and you’re testing whether you can retrieve it then it’s a bit more nuanced than this:

If you can fetch just that single block then maybe that’s enough to prove that they’re at least taking queries and answering them, great.
If you want the entire DAG, maybe they don’t have the entire DAG, maybe they made a deal with just a small portion of a DAG and asking to retrieve the entire thing might be too much. The way lassie currently works, if you ask for a CID and a scope of “all” (the default), then it’s going to try and exhaustively get the entire DAG from somewhere, and if it talks to an SP that claims to have that CID and we do a graphsync, or even bitswap or HTTP just with that SP and it turns out they don’t have the entire thing, because maybe the deal they made was just for a portion of it, then lassie is going to claim a failure because the user asked for “all” but we didn’t get it from there. In default mode, without asking for specific providers, you’ll probably end up working your way through the list of candidates from the indexer, and probably falling back to bitswap to talk to lots of providers to piece it together.

Wikipedia is always my favourite example when talking about this stuff - it ends up being over 300G worth of IPLD blocks. If you ask lassie to fetch bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze then it’s going to work its hardest to find all those 300G worth of blocks before giving you an error that it timed out trying to find them.

Filecoin deals are a maximum of ~32G (or 64G for some SPs), so you can’t fit all of Wikipedia into a single deal. But you could spread it across many deals. But what are the chances that an SP has all of those deals? Quite likely anyone storing that much data is using many SPs to spread their data around, perhaps they’ve given every SP they’re using the entirety of their data and they’re just duplicating, or perhaps they bundled it into a heap of CARs and shunted it off to web3.storage, Esturary, or Spade or something else to deal with the dealmaking and SP selection and they’re scattered to the wind.

An SP that claims to have stored a deal with bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze at the root may have ~32G of the start of the wikipedia DAG but they don’t have the rest. You might be able to retrieve ~32G from that SP with lassie starting at that root, but if you limit it to a single SP and they don’t have deals for all of the wikipedia pieces, then it’s going to fail. What does that mean?

juliangruber commented 1 year ago

For end users fetching data from IPFS (e.g. via Saturn), it's most helpful to know that the entire DAG rooted in CID is retrievable. That may be expensive to verify, though, because the DAG can be spread across multiple storage providers.

We could also split up a CID (say 2M blocks) into 1k samples, where over the period of a week 1k Stations try receiving a small amount of blocks (assuming we can seek to arbitrary offsets) and we get a probabilistic answer that is good enough.

rvagg commented 1 year ago

The big missing piece from all of this - a TODO item from early Filecoin that never got solved - was being able to record an intent descriptor of some kind with stored data. We went live with: root CID + everything under it; which works for a lot of data, but not all, especially not big data because you blow the 32G maximum for deals. That world was a bit of a graphsync world, where parties agree on a root + a selector descriptor of how to get all the data, it's just that we were using * as the selector almost all of the time and never evolved beyond that. To deal with the limitations of that, instead of becoming more sophisticated in our ability to describe the boundaries of a DAG, we've moved Filecoin back toward something like a bitswap world, where everything is about single blocks, not DAGs, and you put the client in charge of piecing together the DAGs and storage providers are really just glorified block providers. That's what the indexer does, that's what attempting to enable bitswap on providers does.

Where we're really at is in an in-between place, where we have both worlds. The bitswap world works, but it's awesome for assembling large DAGs. Graphsync and now HTTP are now in the mix, but they still almost exclusively operate around the notion that the single provider you're talking to has everything under that CID in order to be able to assemble it back together. Graphsync I think is a bit better with affordances for missing pieces of the DAG (we don't have anything like that for the HTTP Trustless Gateway protocol we're using - you have it all or you fail).

There's a bit retrieval problem to be solved here, and we'll end up solving it, but it's not going to be straightforward--how do you reassemble a large DAG spread across many providers? Even with bitswap, which in theory is good at doing this, we've not built retrieval tools to be able to figure out that it needs to go asking for an entirely new set of providers once it exhausts the ones its talking to for the DAG it's working on.

But, the good news is that this isn't a majority case (yet). Most data being stored is relatively small DAGs (<32G); so most "do you have this CID and the entire DAG under it" queries should be :thumbsup:. That's just not universally true and over time will probably erode.

filecoin-station / spark

How to define whether a CID is retrievable? #9