filecoin-project / specs

The Filecoin protocol specification
https://spec.filecoin.io
Other
370 stars 170 forks source link

Reducing the marginal cost of retrieval #431

Open miyazono opened 4 years ago

miyazono commented 4 years ago

Problem Statement

From a cryptoeconomic perspective, I argue that retrieval needs a low marginal cost in order to incentivize rational storage miners to serve files. The more time and processing power required to extract data from a replica, the more expensive retrieval will be for retrieval clients.

It seems likely that early miners will be primarily driven by the block reward, and are likely to hire themselves to store useless data until the expected cost of delays and bandwidth are exceeded by the storage fee paid by the client. In addition, even for miners who store desirable data, the cost of allocating time and processing power required to extract data from a replica must be exceeded by the retrieval fee paid by the client. Because it currently seems to be expensive for consensus miners to retrieve data, I believe that expecting low retrieval prices is optimistic. However, I believe we should (as a team) be able to find a way to securely couple consensus mining and retrieval mining in a way that subsidizes retrieving data with the block reward.

Potential Solution Direction

If we can decrease the cost of data extraction from a replica or shift that cost onto a different portion of the protocol (preferably consensus mining, which is directly subsidized by the block reward), then we can confidently expect to directly reduce the cost of retrieval for rational miners.

Proposal: Could we require that consensus miners cache their data by making them generate their PoSt (and/or perhaps the PoRep) over both the replica and a copy of the original data?

I realize this has a number of undesirable effects:

  1. Storage capacity of the Filecoin network is almost halved (twice as much room on devices as referenced data on the network)
  2. Miners are incentivized to store multiple replicas of the same file, leading to reduced protection of replication redundancy against data loss.

However, I think that the cryptoeconomic benefits of fast and easy retrieval would make this change worthwhile, especially given that mining costs seem likely to be dominated by computational hardware rather than hard drives. In addition, this direction has the benefit that the research and engineering costs here seem very low.

Relation to SealStack and Seal Expansion

Ian and I had attempted to explore if there was agreement on the importance of cheap retrieval in a minimally disruptive way. While I fear it eventually became more disruptive and time consuming than intended (for which I apologize), I believe that we are now fairly confident that there is broad interest in cheap retrieval.

This exploration began with my belief that we could not have fast and efficient extraction because of SealStack (more here here, and here). This led Ian and me to propose and discuss "Seal Expansion" as evidence I believed security could be preserved if a fast and efficient extraction existed.

I've copied a lot of the recent discussion of Seal Expansion into a separate issue, which I've kept separate because I don't want that exploration to distract from the goal of reducing the marginal cost of retrieval.

Next steps

I'd love to start a discussion around

For instance, in a discussion with @whyrusleeping, he pointed out we could use contracts to require that miners retrieve data, for example, by making challenge games where participants are prompted to request and serve data by a contract using randomness from the blockchain.

*edit: fixed sealstack issue number (433)

nicola commented 4 years ago

Quick note: seal expansion would not reduce the cost of retrieval, it would (1) increase the storage cost of sealing for malicious and honest miner, hence mitigate SEALSTACK, and (2) avoid doing retrieval based on sequential operations (if compared to a resolution to SEALSTACK which requires slow sequential retrieval).

In other words, the "cost" (the energy cost) remains the same, but the time cost is reduced (parallel retrieval instead of slow sequential retrieval).

To give you more context: in order to extract a file, a miner must "undo" the zigzag encoding, while this can be parallelized, the energy cost is equivalent to encoding the file.

miyazono commented 4 years ago

Understood, @nicola, thanks for the additional context on extraction.

I'd love to hear if you can think of any issues or difficulty on the proposal of generating a proof over {replica + cached copy}.

ianjdarrow commented 4 years ago

Chiming in on the extreme desirability of fast retrieval here – there is extremely enthusiastic product consensus that trading hard drive space for retrieval latency is a good idea. Tagging @pooja and @mishmosh to add any further refinements to that statement.

@nicola would be amazingly helpful if you could give a rough guess on whether Evan's proposal (generating a proof over {replica + cached copy}) is plausible. If we know whether it's plausible, we can get to a better answer of how to prioritize it.

(To avoid doubt: we understand that this wouldn't mean we could prove the existence of more than one cached copy – but, since we just care about >0 low-latency copies and n replicas existing, that's fine.)

pooja commented 4 years ago

@ianjdarrow is right that fast retrieval is a HUGE and extremely important requirement for any real, meaningful Filecoin use case.

If there is an opportunity to figure this out NOW, I think we absolutely need to figure out whether it works. If it can be done, i.e. if we can meet security requirements with this proposal, we should absolutely do it from a product perspective. 

And if we don't figure it out now, we absolutely have to figure it out one day very soon. So let's figure it out now if it's an option!

I don't know how to be more emphatic over GitHub issues: If there is a secure solution that makes fast retrieval viable, we absolutely should adopt it.

On Wed, Aug 07, 2019 at 12:37 PM, Ian Darrow < notifications@github.com > wrote:

Chiming in on the extreme desirability of fast retrieval here – there is extremely enthusiastic product consensus that trading hard drive space for retrieval latency is a good idea. Tagging @ pooja ( https://github.com/pooja ) and @ mishmosh ( https://github.com/mishmosh ) to add any further refinements to that statement.

@ nicola ( https://github.com/nicola ) would be amazingly helpful if you could give a rough guess on whether Evan's proposal (generating a proof over {replica + cached copy}) is plausible. If we know whether it's plausible, we can get to a better answer of how to prioritize it.

(To avoid doubt: we understand that this wouldn't mean we could prove the existence of more than one cached copy – but, since we just care about >0 low-latency copies and n replicas existing, that's fine.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( https://github.com/filecoin-project/specs/issues/431?email_source=notifications&email_token=ABLH2SY2362YNKWDTWIE4T3QDMP7BA5CNFSM4IJP5T7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3ZPSJI#issuecomment-519239973 ) , or mute the thread ( https://github.com/notifications/unsubscribe-auth/ABLH2S3QA2GK7QU4BXNG4RLQDMP7BANCNFSM4IJP5T7A ).

porcuquine commented 4 years ago

Here are my thoughts. First, I agree with the sentiment: economical retrieval is important. Second, the specific proposal raises some alarms because it seems to undo the benefits of replication. In particular, I suggest it disincentivizes the storage market — which is exactly opposite the intent. However, I also believe there is a way to recover the good properties of the proposal while addressing the problem. This seems to also bring some other desirable properties. Although it's likely further refinement is required, it intuitively seems there may be something to the idea.

The problem with requiring miners to prove they are storing unreplicated data is that it is very cheap to do this for generated data. Replication prevents the generation attack by making it too expensive. This issue's proposal would allow generation-attackers to save half their storage space by regenerating their original data just in time to create PoSt proofs. What's worse, this would be very cheap and not even incur the cost of re-replicating.

Similarly, other attacks prevented by replication would now be available (in order to avoid storing the extra unreplicated copy). For example, miners could deduplicate supposedly redundant replicas. Allowing this feels dirty. In discussion, @miyazono pointed out that pools of colluding miners who share a 'single persistent copy' of the original data must also necessarily instantiate the infrastructure required for cheap retrieval — and might therefore be rationally motivated to provide that service. While this may be true, it does not explain why these rational miners would have entered into the elaborate scheme in the first place when they could much more cheaply just generate their own data. In other words, this creates a greater cost/complexity differential than previously existed between storage-market participants and self-dealing junk-data block-reward-seeking miners.

My addition builds on the proposal to suggest that instead of being required to prove supposed possession of the original data, they instead are required to prove possession of a second partial replica. For the sake of generality, consider that a partial replica can includes those which are both 0% and 100% replicated. This covers both the status quo (proving possession of two identical replicas is essentially equivalent to simply proving possession of the replica); and the proposal: a 0% replica is the original data. Because their purpose is to incent retrieval, call these second committed replicas, retrieval replicas.

To be more precise, when committing a sector, a miner should also commit to a second replica layer. If a full replica entails 10 ZigZag layers, then the selected layer, L, maybe be any value in the range [0,10]. Both L and the layer commitment, commR[L], must also be committed to the chain along with the proof of replication and existing commitments. Regardless of the selected layer, power in the power table is only assigned for the first, full replica. The potential benefit of this scheme is that it allows the miner to directly signal willingness to perform rational retrieval. This signaling represents real economic value and therefore should lead to higher storage market values. This accomplishes the goal of allowing clients to select miners based on likelihood of future retrieval.

Consider some cases:

L = 10: the miner elects to maximize storage space at the cost of expensive retrieval. Clients who value retrieval will pay less to such a miner in storage market deals, all else equal.

L = 0: the miner signals capacity to rapidly present the original data (as required by PoSt). However, it is somewhat questionable whether the miner will in fact store the data — since sharing of original data gives rise to deduplication attacks. Why is this problematic? If it is really the case that retrieval from fully-replicated sectors is too expensive, then loss of the original data represents real loss. Although clients may know that miners will attempt to keep the original data accessible (so they can perform proofs) the possibility of an alternate rational strategy makes this a risky proposition.

More interestingly, consider the case of L = N where N is small (and chosen to have the desired properties). Because N is greater than 0, it is not possible to deduplicate the data without also performing work — since only the original (N=0) data can be shared between replicas. N must therefore be large enough that the cost of transforming the original data through N layers of ZigZag replication makes simply storing the data a better proposition than somehow acquiring and transforming it whenever challenged. A sector replicated only to layer N will also be much faster to extract than a fully-replicated sector would. (As an aside, when L is small enough, partial in-order retrieval of a single layer may be possible.)

Although not optimal for clients, larger L < 10 still represent some value. For example, assume that a miner who selects L = 8 may choose not to actually keep a second replica continuously. Nevertheless, since the miner will be forced to periodically extract a full replica into an 80% replica, the statistical likelihood of fast retrieval is increased; and by agreeing to periodically prove such a second replica, the miner commits to possessing and regularly exercising the infrastructure required to bring a replica to the 80% state — which a pure block-reward miner might well not.

Even if exact calculations are difficult, it seems likely that certain close-enough interpretations of the available options will arise as Schelling points. Because there is no block-reward motivation to devote either storage or computation to anything less than a complete second replica, any commitment to do so rationally represents a willingness to court storage market deals and to provide higher value through faster retrieval. Clients really can understand something meaningful about a miner's likely behavior based on the L they select for a sector, and this will rationally affect storage bids. Since clients must know, before bidding, what L their piece will be stored in, this value should be included in asks and bids.

Details of how each point on the spectrum play out could be calculated and analyzed, but the key point is that these analyses and calculations can and will be carried out by the market regardless. This essentially makes retrieval mining a first-class activity, with commitment to a sub-maximal L representing a kind of earnest money or retrieval pseudo-collateral. Rational miners are still free to try to exploit pricing inefficiencies, but the very fact that these attempts will involve selecting L < 10 provide direct protocol-level encouragement for the storage/retrieval market.

This may fulfill the cryptoeconomic goal of providing a protocol-based means of differentiating the storage market and block reward motivations for miners. Without a way for miners to economically commit to the storage market, clients are left to hope. Because retrieval replicas serve no practical purpose other than as a direct representation of storage-market affiliation, they provide a means of formalizing the dual market we target — and do so through a protocol-level expression of a miner's per-sector degree of 'storage-market-affiliation'. Miner's can validate the hypothesis that this may be rational by observing whether they can afford to ask more for such sectors. Clients can validate it by observing whether such miners do indeed retrieve from these sectors more often than from others.

Of course none of this prevents contractual guarantees of retrieval, but it does provide a natural and recurring mechanism for ensuring confidence that miners are actually in a position to meet such obligations.

nicola commented 4 years ago

Note on other proposals Note that we had also other proposals in the past that can guarantee retrieval (ideally economica will play out for fast retrieval) (https://github.com/filecoin-project/research/blob/master/research-notes/2018-05-jury-trial-data-availability.md).

We just scratched the surface on what can be done.

Note on fast (but not cheap) retrieval I did not read many of the comments, but in the current setting, we already have parallel unsealing hence fast retrieval, so seal expansion would not add much to “fast retrieval”.

Seal expansion would just make it more expensive in terms of storage costs for an attacker to perform seal stack.

So, we already have fast retrieval, but it won’t be “cheap”.

Side note: current retrieval is economically as costly as sealing.

nicola commented 4 years ago

Note on this proposal and why we did not move forward in the past On your proposal, in the past we looked into having the miner doing proof of retrievability along side to proof of replication, however, miners that store all 0s, will gain a 1x advantage in storage, since they don’t have to store the 0s, someone who has data from the market will have to. Making this less likely for a miner to store data from the storage market.

nicola commented 4 years ago

Extension on Porcu idea: slow proof and fast proof (idea originally from Ben) Let me extend on something @porcuquine is suggesting: a miner could be required to do two proofs of useful space, one that is slow and symmetrical which guarantees power (for the research folks, this means symmetric graphs SDR instead of ZigZag) and a fast to retrieve with bad soundness and large space gap (but fast to decode) proof of replication (zigzag with relaxed params).

In other words, we need to increase the cost of storing all 0s.

This new setting will just result into a “slightly” faster replication than what we have today, but presumably a cheaper one.

nicola commented 4 years ago

Note on fast retrieval today Also another note, some miners may decide to offer fast retrieval on their own and might be able to get more reputation/clients for doing so.

nicola commented 4 years ago

Added this to the list of priorities for research, while it’s important to work this out earlier than later, there are some very high priority questions that the team is working on, I would not redirect / reshift priorities now.

This seems to me like first desirable improvement to look into as we get the basic proofs complete.

(For reference: https://github.com/filecoin-project/research/issues/152)

miyazono commented 4 years ago

Responses to @nicola (mostly agreement and extensions), and one question on research timelines at the very end.

So, we already have fast retrieval, but it won’t be “cheap”.

YES. We definitely agree here.

Note on fast retrieval today Also another note, some miners may decide to offer fast retrieval on their own and might be able to get more reputation/clients for doing so.

I agree this is true, but this deepens the split between consensus mining vs storage+retrieval mining, so the reputation system becomes a blocker for a consensus miner considering to stick around for storage fees. If that happens, then the block reward is no longer subsidizing all storage, just storage of useless data which is likely to be more common early on the network, because reputation systems add complexity to cryptoeconomics, since miners stop being identical/interchangable.

Note that we had also other proposals in the past that can guarantee retrieval (ideally economica will play out for fast retrieval) (https://github.com/filecoin-project/research/blob/master/research-notes/2018-05-jury-trial-data-availability.md).

Yes, this is exactly the goal I wanted to explore. (Sorry I didn't realize/see that earlier.) I think this issue is trying to solve a harder problem, preventing malicious miners from having any advantage. I think from a holistic perspective, that low cost retrieval is worth sacrificing some of the space or security of the network.

Note on this proposal and why we did not move forward in the past On your proposal, in the past we looked into having the miner doing proof of retrievability along side to proof of replication, however, miners that store all 0s, will gain a 1x advantage in storage, since they don’t have to store the 0s, someone who has data from the market will have to. Making this less likely for a miner to store data from the storage market.

It's odd, but the economic calculations (see Ian's calculator) make it seem like the financial commitment to mining hardware is primarily computational rather than hard drives themselves. If that's the case, then it seems worth requiring a small added cost for honest miners in order to subsidize cheap retrieval within the protocol.

This seems to me like first desirable improvement to look into as we get the basic proofs complete.

That's awesome! Glad to hear it. Out of curiosity, if product and bizdev said that cheaper retrieval was a blocker for mainnet, would it cause any delays? And, if so, how long?

sternhenri commented 4 years ago

Jumping in a tad late, but the latter is concerning to me:

It's odd, but the economic calculations (see Ian's calculator) make it seem like the financial commitment to mining hardware is primarily computational rather than hard drives themselves.

Worth talking over (cc @whyrusleeping )

zixuanzh commented 4 years ago

Sorry for the late response here, fast, cheap, and frequent retrieval is important but the unique value prop of Filecoin, at least right now, is on verifiable storage. Any agent can prove to another agent that they have stored some files for some period of time. I will be more conservative when faster retrieval comes at the expense of verifiable storage and network security. The key to incentivize fast retrieval is actually reliable proof of retrieval/delivery or verfiable retrieval market.

Assuming that we cannot prove reliable retrieval (I would like to dig deeper here), the bottomline requirement for the retrieval market should be that data can eventually be retrieved when a reasonable price is paid. I also like what @porcuquine is proposing but it might not fundamentally address the problem. Users who paid to a miner with smaller Ls still have no guarantee on whether that data can be retrieved. Nonetheless, it gives the protocol more information about the spectrum of miners which is good.

Overall, we might be getting into a bigger class of problem related to price discovery and verifiable/enforceable markets here. On-chain prices/commitments/guarantees are only meaningful to the extent that they are enforceable or the punishment of not honoring them is real.

miyazono commented 4 years ago

Users who paid to a miner with smaller Ls still have no guarantee on whether that data can be retrieved.

@zixuanzh, the extension to this that I'm claiming is that if it takes little time and computational resources, then we can make a rationality argument that miners will retrieve data for clients cheaply. The greater the cost (or opportunity cost of dedicating resources to other purposes) that it takes to retrieve, the more expensive we should naturally expect retrieval to cost at minimum.

I see your point about not wanting to sacrifice verifiable storage and network security. However, I think it should be pretty easy to quantify changes to the network security for the changes we're proposing.

fulian89 commented 4 years ago

I have an idea to settle the retrieving efficiency problem. I agree that it should be the storage miners’s duty to retrieve the source data and allow customer (with or without the help of retrieval miners) getting back. I also agree that the miner’s efficiency to retrieve the source data really matters for the customer.

So, the core is to measure the time for a storage miner to unseal a sector( to recover the source data). How can we verify that? We can add proof of source data (PoSd for short) in between PoSt. We can set a number K. Averagely every K PoSt proof, a PoSd is needed. The sector and challenge of the PoSd should be (pseudo)random (the hash of PoSt is a choice to determine whether the sector needs a PoSd and the challenge node to prove). And the PoSd should be given before the miner promise to give(a time bond for PoSd).

Besides giving a proof of retrieval efficiency, the PoSd also gives a proof of completely retrievability of source data. That is really important. Even if the storage miner is dishonest on one node in some layer, the replica can hardly retrieve the source data. PoRep and PoSt can do noting about it, while PoSd can.

Using trivial methods, PoSd doesn’t add the storage space(so do not affect the consensus of filecoin). Under conditions of customer retrieving the data at high efficiency, the storage miner may choose to store the source data directly(since retrieve data so efficient is impossible or to expensive). The final solution is add retrieval efficiency parameter on the order(thus will influence the price) and add PoSd.

Also after adding PoSd, rational miner has less motivation on cheating while sealing a sector. So the number of challenges per layer can be smaller([on the number of challenges per layer]https://github.com/filecoin-project/specs/issues/474) . Also a bigger sector size may be a better choice for filecoin if we can release a (part retrieved) sector to form a new one ( [on the size of a sector ]https://github.com/filecoin-project/specs/issues/475) .