Retrieval by any CID: broadcast & usage

mishmosh commented 4 years ago

Now that https://github.com/filecoin-project/go-fil-markets/pull/166 is merged, go-fil-markets optionally supports retrieval by any CID, not just root (let's call these deep CIDs). To opt-in, the provider (miner) will need to maintain an index of deep CIDs, which will cost them extra storage space and overhead.

We need to clarify when & how this ability can be used. In other words, how do providers tell clients if they’re offering to maintain a CID index?

Requirements during storage phase

The ability to do partial retrieve by deep CID is a deciding factor in choosing storage solutions. When proposing storage deals, clients want to know whether whatever they're getting themselves into will support retrieving data by deep CID.

Requirements during retrieval phase

Retrieval clients should know which providers (retrieval miners) support deep CIDs before proposing a deal.

Timing

Design needs to be complete & agreed upon soon in case any protocol changes are required. However, implementation of this feature could be punted until after mainnet if necessary.

Use cases for any-CID retrieval

A general need of "retrieve less than a sector" has been discussed, but let's get more specific:

A decentralized storage app uses Filecoin for backup, making storage deals in units of 32GB. When one of their users needs to recover 5 minutes' worth of data (let's say 15MB), they want to serve it back relatively quickly, and don't need (or want to use bandwidth for) any other data in that sector.
(What else?)

Misc considerations

Storage clients who want this feature really care about it, and would be willing to pay more for this capability. - A promise to store data is not a promise to retrieve (though one way a storage client might hedge against their data being held hostage is doing erasure coding). - A retrieval miner always has the software to unseal a sector by CID (because full nodes).

Possible directions

1. On-chain, as part of StorageDealProposal Only as a last resort, let's avoid more data on-chain if possible

2. Put info into the Ask (off-chain) But how do you enforce, if miner stops maintaining CID index?

3. Punt this to the retrieval market, make it required The first time a sector is retrieved, it fetches the whole sector and creates the CID index. This is slow. [Would a request need to include both the deep CID and root CID, in case the former fails/isn't recognized?] However, retrieval miners then maintain that CID index for a certain amount of time by default, and subsequent retrievals for any deep CID are faster.

4. What else?

cc @hannahhoward @jbenet @whyrusleeping @sternhenri @zixuanzh @anorth

sternhenri commented 4 years ago

This is the first time I'm thinking about this so there may be existing a prioris I am missing.

It seems to me like 1 and 2 are the extreme solutions here and present very different product considerations. In any event, it seems like you would need some sort of reputational layer to help enforce this if it is not required.

I think it makes sense to punt this to the retrieval market (potentially based on hypothesis that the sort of data for which this will be most useful is of the ilk that should be cached: often accessed, not necessarily useful in whole).

Whether or not it should be required is a separate question, but it definitely makes things easier.

There would probably need to be some off-chain mapping available from deep CIDs to root CIDs...

I don't think any alternatives that include setting up some sort of pledge for storage/retrieval miners claiming to provide this service that could be slashed if they do not is viable.

mishmosh commented 4 years ago

Does the client send over the CID index, or do they get re-computed on the miner side?

mikeal commented 4 years ago

I’d like to understand where some of these requirements are coming from because, while I would love to have the flexibility to query for any CID, even I’m having a hard time justifying the tradeoffs that would have to be made.

The ability to do partial retrieve by deep CID is a deciding factor in choosing storage solutions. When proposing storage deals, clients want to know whether whatever they're getting themselves into will support retrieving data by deep CID.

What is this use case for this?

You still have to know the commp and/or root CID of the CAR file in order to do the retrieval. Which means that, no matter what, you’ve got a mapping somewhere of “this data is in this CAR file.” What are the use cases where you would know that but you wouldn’t be capable of creating a CAR file that is either structured in such a way that you can query it without specific CID’s or you could just store the traversal necessary to reach any CID wherever you’re storing the data about the CAR file?

In the data we’re preparing we have hundreds of millions of files and file slices in a database, each of which contains info about the CAR file the data is in and a selector to the root of the file within that CAR file. We write the CAR files with a root node that is just an array of CID’s for the roots of all the individual files, so the traversal we store is just rootCID/offset.

Punt this to the retrieval market, make it required

We should not downplay how expensive this may be. Optimistically, you can max out the block size and you’re only storing large files, so you’ve got 1000 CIDs per 1GB of raw data plus a few blocks of unixfsv1 data and a root block for the CAR file, but let’s round that down to 1 CID per MB.

So, storing a PB of data will cost you 1 Billion CID indexes. That’s substantial.

That’s the most optimistic view possible. In practice, we’re seeing data sets with many many small files, so the block size is much smaller and the unixfs overhead is more visible, so 1PB of data is closer to 50 Billion CID’s.

Unless we think looking up arbitrary CID’s is a primary use case I don’t see how this overhead is justified.

On-chain, as part of StorageDealProposal

Do you mean there is a boolean property in the deal for CID indexing or do you mean that the CID’s are on chain? Based on the data we have “on deck,” that’s probably 100 Billion CID’s on chain shortly after launch, so I’m going to assume this isn’t what you meant.

If they get rid of the index then they would suffer a much more expensive scan when they end up accepting a query later for an arbitrary CID.

Again, I’m not sure this needs to be a requirement at all, but if it were this makes the most sense to me. If the requirement is in the ask then the node would have to respond to this type of retrieval and if they decide not to keep around the index they are penalized plenty by having to do the scan.

mikeal commented 4 years ago

One thing I just realized is that “indexing every CID” and “query by arbitrary CID in the CAR file” are being conflated a bit. Indexing every CID is a good idea if you want to improve all retrievals. Querying by arbitrary CID in your data doesn’t actually require an index and there’s an alternative we haven’t considered yet.

If we improved selectors to understand our HAMT spec, we could punt this to the storage side.

The way it would work is, if you want a CID index of your data you’d need to store a CID map in the CAR file. You’d prepare your graph, then create a HAMT to map every CID to every block in your data, then you’d create a root block for you CAR file that points at the HAMT. Then you can create selectors against the HAMT for any CID in your data, the only CID’s you wouldn’t be able to query would be the ones for the HAMT itself and I don’t see what use case that would be required for.

mikeal commented 4 years ago

A decentralized storage app uses Filecoin for backup, making storage deals in units of 32GB. When one of their users needs to recover 5 minutes' worth of data (let's say 15MB), they want to serve it back relatively quickly, and don't need (or want to use bandwidth for) any other data in that sector.

This is why retrievals have selectors, so you can get less than the data in the sector. How much data depends on how precise the selector is, which depends on how precise the data structure you built and store in the CAR file is.

There’s no reason this use case must be done with arbitrary CID retrieval.

Backups are stored as diffs. It’s not hard to imagine different structures you could build the diff as that would allow for different types of queries.

ribasushi commented 4 years ago

Backups are stored as diffs

This only applies to "old school backups". Content addressable systems generally elect to derive a "root" anew, and let any deduplication shake out on its own. I.e. git trees are not diffed against each other - they are forever standalone.

The more of the underlying storage moves into the content-addressable realm, the fewer applications of "I will send a delta diff" you are going to encounter.

mikeal commented 4 years ago

This only applies to "old school backups". Content addressable systems generally elect to derive a "root" anew, and let any deduplication shake out on its own. I.e. git trees are not diffed against each other - they are forever standalone.

I’d like to see this in practice.

This describes how we currently update a graph and GC, we just swap the root and eventually clear any blocks that aren’t referenced. But that doesn’t produce a usable backup so I don’t think the strategy will be identical when you do produce one.

You have to keep in mind that your backup is rarely comparing a complete old state to a complete new state, you’re comparing an old backup state (partial) to the new complete state. In order to make these perform well it’s useful to store each backup as a unique “backup” structure and not a bag of blocks.

But even if we do things the way you suggest, the process of identifying all of the new blocks to include in the backup will end up using some sort of map or vector which you could use to produce a HAMT that you stick in the CAR file as well. So if we were to improve selectors to understand a HAMT I think you’d have what you need to query the CAR for all your blocks.

anorth commented 4 years ago

There is no enforcement of retrievability at all. There is a market, and assumed to be a price at which it becomes worthwhile (given proof that the provider does indeed have the data accessible). Similarly, I expect special types of retrieval to be a market problem.

As I understand it, the largest cost of retrieval is unsealing the sector. I think we support partial unsealing, but this changes as the PoRep changes and I'm not sure today (@porcuquine?). The whole CAR-file worth of a deal probably needs to be unsealed.

@mikeal's ideas above are really promising for limiting the amount of data actually transmitted, to the extent that bandwidth is a significant part of the retrieval cost.

porcuquine commented 4 years ago

SDR does have partial unsealing, but most of the cost is paid by unsealing at all. Candidate constructions for the proofs upgrade may be better in this regard but aren't yet confirmed.

filecoin-project / specs