filecoin-project / specs

The Filecoin protocol specification
https://spec.filecoin.io
Other
368 stars 170 forks source link

Seal Expansion #433

Open miyazono opened 5 years ago

miyazono commented 5 years ago

I'm including some notes below from the email thread on Seal Expansion. This is related to this issue around reducing the marginal cost of retrieval.

Original Problem:

It was my impression that the retrieval market was likely be slow because replication generation was going to be symmetric to combat SealStack, and therefore data extraction from the replica would also be slow. However, I have since been told that

The Proposal

Approximately as emailed by @ianjdarrow:

tl;dr let's discuss the practicality of Evan's seal expansion proposal for Filecoin to potentially solve some hard product tradeoffs!

Hello, research team!

I've chatted with several of you about an alternative mechanism Evan devised ("seal expansion") to allow for fast unsealing while still substantially mitigating sealstack. This would allow us to reduce retrieval times from ~very slow to "as fast as we can make it," but at the cost of storage requiring more hard drive space and perhaps altering some cryptoeconomic incentives.

The gist of the proposal is that we add an "expansion factor" to PoRep that results in sealed copies being somewhat larger than the underlying files. This takes the potential benefit of seal stacking from ~infinite to (x/x-1), where x = expansion factor. This means miners will sometimes (maybe always! maybe not...) rationally choose to store real files for storage fees instead of stacking, and we can model whether we expect that to happen under real-world circumstances, which is cool.

I made a calculator where you can play around with different expansion factors and other assumptions to guess where the break-even point between stacking and storing real files would be. It lives here: https://exp.iandarrow.com. The underlying scratchpad is here (might not be 100% up to date, but is close).

The idea of faster retrieval has extremely enthusiastic support from the Product team – slow retrieval is one of their biggest concerns. I believe DL is still thinking through the proposal, but is optimistic that it could also be better from a cryptoecon standpoint.

I have no doubt that there are superior solutions to this problem, but this one has the nice traits of (1) being known today and (2) maybe not being super difficult to implement.

Given that there are big product and potential cryptoecon advantages, it seems worth discussing:

  • whether seal expansion can work from a proofs/cryptolab/protocol design perspective
  • how much work would be involved to actually use seal expansion instead of the existing unsealing mechanism

Hoping we can pull folks together to discuss – maybe next week? Let me know if any objections; if not, I'll get something on the calendar.

(Footnote: I acknowledge that I'm not a researcher and lots of smart people have put lots of thought into this problem. I come in the spirit of humility and learning!)

Cheers, Ian

I'm going to start summarizing here, given how long this already is. If anyone feels like I've misrepresented anything you've said, please edit or add to this issue. If anyone else would like to know more information about what was said, please feel free to comment.

Brief Comment from @sternhenri

  • positive feedback generally, cryptoeconomically
  • thinks (x/x-1) read is a bit pessimistic (fixed sector sizes constrain adversarial power and limit how much you can abuse stacks).
  • I can't speak to the feasibility/difficulty for implementation
  • Welcomes a quick brainstorm around it

Additional information and Concerns from @nicola

Additional comments

@jbenet commented

Hey - great idea! 🌟

  • reading this was a great research rollercoaster 🎢
  • expansion in storage we should do!
  • (Though 1x of this expansion factor harms “rational post” argument (still fine, just note it’s an added cost there))
  • exponential expansion would be ideal... but I think you can’t do that...
  • I believe the blow up factor is proportional to the gains from doing seal stack. Eg even with a 2x blowup (ie replica is 2x larger), one iteration, the attacker gets away with proving 1.5y virtual storage for y physical storage (this is what Nicola is getting at with sealing the halves I think)
  • this gets safer with much higher blowup (eg 5x or 10x), and that may be ok because...
  • 💥 idea: buyer choice is the name of the game (market). Maybe make it so you can pay proportionally more to store it in a “fast to unseal” replica. you’re paying for the extra cost of storage, and get to unseal fast. Slow unseal is fine for lots of archival data. Fast unseal comes at a higher cost. (We were already planning to do this via retrieval miners keeping unsealed copies around)

@porcuquine's comments

Copied for simplicity, since there's too much to summarize Before reading this I had conversations with both Ian and Evan about retrieval in general. Ian asked me to write some of them down in this thread. I'm going to paste in some notes below. They don't really address expansion directly, but they do provide some more thoughts about retrieval (and perhaps some more detailed information for those not familiar). In any case, my intent isn't to derail — just to provide more thoughts about considerations of fast retrieval. I think it's very important, and it's one of the factors motivating multiple sector sizes. The idea is that we need to provide enough flexibility to allow miners and clients to negotiate good outcomes across the natural range of tradeoffs.

The main new (to me) idea presented below is the idea that partially-extracted caches can co-exist with Rational PoSt to allow miners more economic flexibility to choose fast retrieval if they are rewarded for providing it. The rest is mostly review of techniques for fast retrieval.

With apologies for a crude and hasty composition — wanted to just get those ideas written down while I have time.

Regards, @porcuquine

cont.

I'm going to dive a little more into the partial caching idea, because I may have left too much to the imagination.

The idea is this: in general, miners are allowed to delete a certain amount of data and reseal it just in time. Because of the polling time, this is limited. The first suggestion is that miners who prioritize retrieval take advantage of this opportunity to keep all of their fast-retrieval sectors in a partially-extracted state. Then, when any of those sectors is challenged, they simply complete the partial replication in time to meet the challenge.

But we want to do better than that because we assume someone is motivated to provide fastest possible retrieval of some sectors. In that case, for these fastest-retrieval sectors, the miners could keep the sectors in an even-more-extracted state. It might be the case that the rewards are such that they can tolerate an occasional single fault (since they will always be able to recover the data for the next proving period). However, even if that is true they would need to restrict this to a small portion of their sectors to avoid this happening often.

Therefore, they could also keep a portion of their sectors in some faster-retrieval state then perform just enough extraction during each proving period to ensure that they will be able to answer any challenge. Then, once the challenges have been received, they can return the sector to its more-extracted state to serve any requests which might come in.

What is the benefit of doing this vs just waiting until a request comes in and throwing more parallelism at extraction? Maybe none, in some cases. However, there are problems with maximizing parallelism. First, when the required parallelism becomes too large, it complicates the problem itself. Second, in order to serve many high-parallelism requests at once, a very high total number of cores needs to be available. Therefore, a miner who wants to provide high-availability of many fast-retrieval sectors either needs a very large amount of on-demand compute, or needs to find a way to spread the work out over more time (so less intense bursting is required).

Partially-extracted caching is a way of addressing this problem. Note that this is completely flexible. Any sector can be kept in and transitioned to any percentage extraction at any time. The more slowly these transitions happen, the lower the overall hardware requirements are. Just as true retrieval miners have opportunities to be very clever in their algorithms for anticipating demand, so do storage miners using this strategy.

This possibility suggests that the most successful players will be those with robust information about what they are storing so they can make good guesses about what pieces should be in what state. This extends to knowledge at packing time. These considerations may work against our preference or assumptions for how the client-miner relationship functions. I'm not sure. I do think this presents a strong argument that the most successful operators will be those who control entities at every layer of the stack. But with good metadata, reputation, and granular contracts, it should be possible for the ecosystem to incentivize and permit good decisions at each layer.

Example: popular celeb tweets just so, and the year 1987 is trending. Since the savvy miner is always rebalancing his sector retrieval profile anyway, he makes use of this information and ensure that his pop data nearest to the celeb in the content graph is most available. Likewise true retrieval miners begin pre-emptively caching similar material. In the spot pricing battle for individual pieces (or complete 'catalogs'), the miners who have optimized best win and reap the rewards. (This is reminiscent of high-frequency trading, and I think if there is true demand for fast retrieval and a deep/broad selection of content over time, then we should expect such behavior to emerge. I realize this is not a rigorous argument.)

Regards, @porcuquine

pooja commented 5 years ago

Can someone explain why data extraction will be slow under almost all circumstances?

miyazono commented 5 years ago

Here's my understanding, @pooja, which may be outdated, but hopefully I can save some FIL researchers time so they only give minor corrections. Also, apologies if I'm not providing additional information; I've lost track of who knows what.

To go from a sector to a replica, there's a set of interdependent computations that have to be done (typically represented as a DAG, with raw data on one side and the replica on the other - edges are computations and nodes are intermediate results). The computations and intermediate results were initially intended to be very interdependent in the generation of a replica (to make generating a replica slow), and intended to be highly parallelizeable for extraction. However, the computations (which are reversible) still have to be undone to extract. That's what porcu means when they wrote

there are problems with maximizing parallelism. First, when the required parallelism becomes too large, it complicates the problem itself. Second, in order to serve many high-parallelism requests at once, a very high total number of cores needs to be available. Therefore, a miner who wants to provide high-availability of many fast-retrieval sectors either needs a very large amount of on-demand compute, or needs to find a way to spread the work out over more time (so less intense bursting is required).

Specifically, to extract data from a replica will likely be slow unless miners do things like set up their compute capabilities for highly parallelized extraction. It may be simpler to do things like cache partially-extracted data. Either of these would make extraction faster (but not necessarily cheaper, since it would take additional hardware specifically and solely tasked to make retrieval mining faster).

porcuquine commented 5 years ago

Can someone explain why data extraction will be slow under almost all circumstances?

I don't know how you reach that conclusion. The problem is probably that we're using words like 'slow' and 'fast' as though they have absolute meaning, whereas in reality they only mean something relative to something else, and in a specific context. Presumably, it's this mismatch which makes it possible for you to ask a question whose premise I disagree with so completely. That is, it's not a given that data extraction will be slow under almost all circumstances. I mention this not to be argumentative but to point out that productive discussion probably requires that we stop talking about 'fast' and 'slow' retrieval without a lot more qualification. Otherwise, the discussion starts to take on a pro-choice/pro-life quality which I think increases noise. (I'm not pointing fingers at anyone here: noting a latent pattern in the discussion.)

pooja commented 5 years ago

Thanks @miyazono.

@porcuquine I was asking because miyazono's original issue starts with:

Data extraction is likely to be slow/costly under most (but not necessarily all) circumstances. As a result, Seal Expansion as a proposal is less useful, since it was intended to mitigate SealStack, assuming a fast extraction was possible. However, for the sake of clarity and openness, I'm adding details of the Seal Expansion proposal below with some of the discussion that took place.

pooja commented 5 years ago

To make this discussion less noisy, can you quantify (rough order of magnitude) speeds of extraction?

nicola commented 5 years ago

unsealing costs the same as sealing in terms of CPU cost, but it's parallelizable, so the time it takes to unseal should be roughly 1/number of cores of the sealing time

porcuquine commented 5 years ago

@porcuquine I was asking because miyazono's original issue starts with:

Data extraction is likely to be slow/costly under most (but not necessarily all) circumstances. As a result, Seal Expansion as a proposal is less useful, since it was intended to mitigate SealStack, assuming a fast extraction was possible. However, for the sake of clarity and openness, I'm adding details of the Seal Expansion proposal below with some of the discussion that took place.

Got it. I tried to address in my comments above. Specifically, there are many cases where extraction need not be slow. At the end of the day, extraction speed is bounded only by a combination of CPU and block time — as far as I know. The more CPU, the faster you can go with no limit except parallelization overhead. The smaller the sector, the less parallelism required to achieve a given extraction speed — but there is a lower bound to sector size determined by the block time.