Storage / Retrieval Deals With Partial Content

hannahhoward commented 3 years ago

Checklist

[X] This is not a new feature or an enhancement to the Filecoin protocol. If it is, please open an FIP issue.
[X] This is not brainstorming ideas. If you have an idea you'd like to discuss, please open a new discussion on the lotus forum and select the category as Ideas.
[X] I have a specific, actionable, and well motivated feature request to propose.

Lotus component

[ ] lotus daemon - chain sync
[ ] lotus miner - mining and block production
[ ] lotus miner/worker - sealing
[ ] lotus miner - proving(WindowPoSt)
[X] lotus miner/market - storage deal
[X] lotus miner/market - retrieval deal
[X] lotus miner/market - data transfer
[ ] lotus client
[ ] lotus JSON-RPC API
[ ] lotus message management (mpool)
[ ] Other

What is the motivation behind this feature request? Is your feature request related to a problem? Please describe.

Let's say I want to store a large existing IPLD dataset larger than a sector on Filecoin. Currently, we face several obstacles:

Right now, from a storage standpoint, the only way to store anything but a whole DAG is an offline deal
From a retrieval standpoint, we can retrieve a partial DAG via expressing a selector other than "give me the entire DAG". But there are various problems here for our large dataset:
1. We can't do this at the CLI level currently cause we lack a command line syntax for selectors.
2. Even if we could, the syntax for selectors is limited ATM -- we lack a "give me the whole DAG except the part below this CID cause I know it's in another piece" selector
3. Even if we had more powerful selectors, selectors require the retrieval client to know a-priori what the right selector is to get the part of the DAG contained in a single sector.

Let's consider what we'd like to be possible:

The person storing should be able to break up their very large DAG in arbitrary ways into a set of partial DAGs
The person retrieving should be able to just start at the root, make a retrieval, see what they get back, and then plan to make retrievals from there.

We also already have alternate storage clients like Estuary that are failing proposed deals cause they are trying to send partial DAG data to miners.

Describe the solution you'd like

Fortunately, our underlying transport protocol for data transfer, Graphsync, can serve requests where the peer sending the data only has part of the DAG expressed by the requested CID+Selector. The Graphsync responder knows how to communicate to request what it served and what it didn't, and the requestor knows how to process this information and still verify the response.

Currently, the go-data-transfer library currently fails all transfers where the entire request root + IPLD selector is not served.

I propose that we allow data transfers to complete successful for a transfer that have only serves a partial response.

My proposed bubbling up to Lotus is as follows:

go-data-transfer should emit an event on both sides to notify the calling library of a CID that was not served and was skipped over
go-data-transfer should have a new final status of PartiallyCompleted for when a transfer is done sending/receiving but the entire DAG was not served (plus possibly some additional events that put it in this state)
go-fil-markets storage client will fire a ClientEventDataTransferComplete when go-data-transfer ends in PartiallyCompleted (the same event emitted when data transfer ends in Completed) and otherwise be unchanged
go-fil-markets storage provider will fire a ProviderEventDataTransferCompleted when go-data-transfer ends in PartiallyCompleted (the same event emitted when data transfer ends in Completed) and otherwise be unchanged. The CommP calculation will be run on the received CAR file for the partial DAG and as long as it matches the Storage Proposal, the deal will continue as planned
go-fil-markets retrieval client will fire ClientEventPartiallyComplete when data transfer ends with the PartiallyCompleted status. This will trigger analogous "Partial" states for DealStatusCheckComplete and DealStatusFinalizingBlockstore, which will transition to DealStatusPartiallyComplete as the retrieval client's final status
go-fil-markets retrieval provider will fire ProviderEventPartiallyComplete when a datatransfer ends with the PartiallyCompleted status. This will move the deal to DealStatusPartiallyCompleting and then DealStatusPartiallyCompleted when CleanupDeal is finished.
at the Lotus API level, ClientRetrieve is unchanged -- it just returns statuses from retrieval client
at the CLI level, ClientRetrieve will output all retrieval statuses and a final message indicating that only partial transfer was completed

Describe alternatives you've considered

see above -- while selectors are a path forward potentially they have several limitations and the path to achieving a desirable result through them is long

Additional context

I am specifically suggesting leaving the LOTUS import process unchanged for now -- we are not trying to solve importing partial DAGs into Lotus at the moment.
Rather the client that already has a need for this functionality is Estuary, so what's ultimately most import is for Lotus to support this on the miner side, and the retrieval client side

jennijuju commented 3 years ago

Cc @whyrusleeping @dirkmc @aarshkshah1992 @raulk for review

hannahhoward commented 3 years ago

I want to point out why you want this AS WELL AS very good selectors.

We already have a StopAt selector in latest versions of go-ipld-prime: https://github.com/ipld/go-ipld-prime/pull/214

However, particularly in the retrieval case, the client may not know how to assemble this selector ahead of time. If I make a deal for a complex DAG with several missing pieces, for a client to retrieve this with a selector they need to know ahead of time what pieces are missing. This is pretty tricky to communicate -- or it adds overhead to discovery mechanisms.

It seems ideal to still be able to serve a "not quite complete retrieval" as a fallback

jsign commented 3 years ago

We're interested in this feature. It would make packing bigger-than-a-sectors DAGs in sector-sized deals much simpler since we don't have to deal with "complete-subdags" constraints. So, just pack the max amount of blocks possible and let the retriever know that should retrieve X deals to get the complete thing.

If doing partial retrievals makes sense for the client, so then let that be an "application" constraint that should be considered while packing things in deals; but not really mandatory.

filecoin-project / lotus