miyazono commented 5 years ago

I'm including some notes below from the email thread on Seal Expansion. This is related to this issue around reducing the marginal cost of retrieval.

Original Problem:

It was my impression that the retrieval market was likely be slow because replication generation was going to be symmetric to combat SealStack, and therefore data extraction from the replica would also be slow. However, I have since been told that

Rational PoSt would likely solve SealStack concerns, and while have yet to review the PR to be convinced myself, I have great confidence in the FIL research team.
Data extraction is likely to be slow/costly under most (but not necessarily all) circumstances. As a result, Seal Expansion as a proposal is less useful, since it was intended to mitigate SealStack, assuming a fast extraction was possible. However, for the sake of clarity and openness, I'm adding details of the Seal Expansion proposal below with some of the discussion that took place.

The Proposal

Approximately as emailed by @ianjdarrow:

tl;dr let's discuss the practicality of Evan's seal expansion proposal for Filecoin to potentially solve some hard product tradeoffs!

Hello, research team!

I've chatted with several of you about an alternative mechanism Evan devised ("seal expansion") to allow for fast unsealing while still substantially mitigating sealstack. This would allow us to reduce retrieval times from ~very slow to "as fast as we can make it," but at the cost of storage requiring more hard drive space and perhaps altering some cryptoeconomic incentives.

The gist of the proposal is that we add an "expansion factor" to PoRep that results in sealed copies being somewhat larger than the underlying files. This takes the potential benefit of seal stacking from ~infinite to (x/x-1), where x = expansion factor. This means miners will sometimes (maybe always! maybe not...) rationally choose to store real files for storage fees instead of stacking, and we can model whether we expect that to happen under real-world circumstances, which is cool.

I made a calculator where you can play around with different expansion factors and other assumptions to guess where the break-even point between stacking and storing real files would be. It lives here: https://exp.iandarrow.com. The underlying scratchpad is here (might not be 100% up to date, but is close).

The idea of faster retrieval has extremely enthusiastic support from the Product team – slow retrieval is one of their biggest concerns. I believe DL is still thinking through the proposal, but is optimistic that it could also be better from a cryptoecon standpoint.

I have no doubt that there are superior solutions to this problem, but this one has the nice traits of (1) being known today and (2) maybe not being super difficult to implement.

Given that there are big product and potential cryptoecon advantages, it seems worth discussing:

whether seal expansion can work from a proofs/cryptolab/protocol design perspective

how much work would be involved to actually use seal expansion instead of the existing unsealing mechanism

Hoping we can pull folks together to discuss – maybe next week? Let me know if any objections; if not, I'll get something on the calendar.

(Footnote: I acknowledge that I'm not a researcher and lots of smart people have put lots of thought into this problem. I come in the spirit of humility and learning!)

Cheers, Ian

I'm going to start summarizing here, given how long this already is. If anyone feels like I've misrepresented anything you've said, please edit or add to this issue. If anyone else would like to know more information about what was said, please feel free to comment.

Brief Comment from @sternhenri

positive feedback generally, cryptoeconomically

thinks (x/x-1) read is a bit pessimistic (fixed sector sizes constrain adversarial power and limit how much you can abuse stacks).

I can't speak to the feasibility/difficulty for implementation

Welcomes a quick brainstorm around it

Additional information and Concerns from @nicola

SealStack info here: https://github.com/filecoin-project/specs/blob/master/notes/solving-sealstack.md
SealStack likely mitigated by Rational PoSt
Some history about expanding sectors that had already been considered, including massive keys to encode the replica (so you'd store the replica instead) or erasure coding to improve storage guarantees
Primary concern that blowup is linear rather than exponential; exponential would be great
- {I later confirmed that the x/(x+1) is a linear blowup coefficient}
Secondary concerns
- Would unsealing fractional sectors be cheaper? This wouldn't make sealstack more costly
- If people generate the replica dynamically, you could sealstack without any extra storage needed. However, this shouldn't be a problem with rationality assumptions.
  - {I'm not sure why, this seems like a fatal flaw in expansion}
- Expansion factor should be at least 1.25, since generating 1/4 of the nodes is cheap
A word of caution about treating any proposal as concrete, functional, or known to work before security analysis has been completed.

Additional comments

@ianjdarrow and I replied, with me responding to some of @nicola's points and questions. The most important are listed above in {}.
@whyrusleeping commented favorably on fast unsealing

@jbenet commented

Hey - great idea! 🌟

reading this was a great research rollercoaster 🎢

expansion in storage we should do!

(Though 1x of this expansion factor harms “rational post” argument (still fine, just note it’s an added cost there))

exponential expansion would be ideal... but I think you can’t do that...

I believe the blow up factor is proportional to the gains from doing seal stack. Eg even with a 2x blowup (ie replica is 2x larger), one iteration, the attacker gets away with proving 1.5y virtual storage for y physical storage (this is what Nicola is getting at with sealing the halves I think)

this gets safer with much higher blowup (eg 5x or 10x), and that may be ok because...

💥 idea: buyer choice is the name of the game (market). Maybe make it so you can pay proportionally more to store it in a “fast to unseal” replica. you’re paying for the extra cost of storage, and get to unseal fast. Slow unseal is fine for lots of archival data. Fast unseal comes at a higher cost. (We were already planning to do this via retrieval miners keeping unsealed copies around)

@porcuquine's comments

Copied for simplicity, since there's too much to summarize Before reading this I had conversations with both Ian and Evan about retrieval in general. Ian asked me to write some of them down in this thread. I'm going to paste in some notes below. They don't really address expansion directly, but they do provide some more thoughts about retrieval (and perhaps some more detailed information for those not familiar). In any case, my intent isn't to derail — just to provide more thoughts about considerations of fast retrieval. I think it's very important, and it's one of the factors motivating multiple sector sizes. The idea is that we need to provide enough flexibility to allow miners and clients to negotiate good outcomes across the natural range of tradeoffs.

The main new (to me) idea presented below is the idea that partially-extracted caches can co-exist with Rational PoSt to allow miners more economic flexibility to choose fast retrieval if they are rewarded for providing it. The rest is mostly review of techniques for fast retrieval.

Retrieval ** Terminology
Better not to talk about seal vs unseal.
Seal = replication + proof
Retrieval only requires reversing replication.
Should talk about replication vs extraction. ** Retrieval characteristics review
Sealstack mitigation required?
- If so, then replication and extraction take the same time (symmetrical).
- Expansion wants to prevent this. Here I assume Rational PoSt does.
- If not, that is the status quo of implementation
- Retrieval can be parallelized arbitrarily. ** How to guarantee?
Contracts
- Legal: SLA, etc.
- Smart contract:
- Retrieval collateral + escrow + intermediary. Can flesh this out later, but I believe it's possible. Mostly a placeholder.
- Does add overhead, but is a way of ensuring retrievability to anchor market.
Reputation
In general, fast retrieval is valuable, so a market for it will emerge, given the necessary tools to support it. (So we should do that.) ** How to make fast and/or cheap
Small sectors
- Viable because of multiple sector sizes but involve tradeoffs, since they are more expensive per byte.
- Limitation: polling time + block time set hard limit on minimum sector size. However, this does not bound retrieval speed, only cost.
- If we really care about this, shorten the block time. (This is at least another way to attack that source of limitation.)
Parallelize
- Can scale retrieval speed arbitrarily: pay for speed directly.
Partial extraction
- Use packing order to maximize.
- ~Last~ First piece in sector can be extracted somewhat faster.
- Repeating theme: miner's have choices.
- Relatively small gain by itself.
Caching and partial caching
- Consider that partial extraction leaves sector ready to continue extraction rapidly.
- But caching takes extra space.
- Or does it?
- Yes, if we fully extract a sector and don't allocate space for it, then we implement the malicious data-deletion strategy. Rational PoSt suggests we should not do this because it is too expensive.
- Rational PoSt says that even though miners /could/ delete and regenerate then re-replicate data, the cost of doing so makes it irrational.
- However, if miners have economic incentive for fast retrieval, then the cost of sometimes having to re-replicate is offset by benefit of offering fast retrieval.
- This is fine from security perspective, I argue: they really are storing the data. (i.e. it's not really malicious data deletion)
- In other words, the smooth choice between storing fully-extracted/cached and fully replicated sectors is entirely up to the miner.
  - In practice, miners are limited to partially-extracting 25% so they can reach full replication in 1/4 the replication time if challenged. [Need to review Rational PoSt: that number might be wrong, but the principle is same.]
  - But, if a miner wanted to shift balance toward retrieval for some percentage of sectors, even having to sometimes fault might be worth the benefit. This assumes miners can charge a premium for faster retrieval.
- Rational PoSt provides more flexibility, and incentivized retrieval shifts the balance toward partially/fully-extracted caches.
  - Does this observation threaten Rational PoSt?
  - I don't /think/ so. Needs consideration. ** True retrieval miners
Actual retrieval miners will cache frequently-accessed pieces or those projected to be.
The above strategies allow storage miners to play this game, and the best retrieval miners may implement some on their own behalf.
But the essence of retrieval mining is optimization for speed, so much fully-extracted caching is likely.
This is not a problem. What is important is that it be possible to guarantee retrieval 'at all', and to guarantee retrieval-speeds-for-price.
Storage contracts can/will include retrieval time guarantees and penalties.
Clients, storage miners, and retrieval miners all play a part in this market and have much flexibility to meet demands in multiple dimensions.

With apologies for a crude and hasty composition — wanted to just get those ideas written down while I have time.

Regards, @porcuquine

cont.

I'm going to dive a little more into the partial caching idea, because I may have left too much to the imagination.

The idea is this: in general, miners are allowed to delete a certain amount of data and reseal it just in time. Because of the polling time, this is limited. The first suggestion is that miners who prioritize retrieval take advantage of this opportunity to keep all of their fast-retrieval sectors in a partially-extracted state. Then, when any of those sectors is challenged, they simply complete the partial replication in time to meet the challenge.

But we want to do better than that because we assume someone is motivated to provide fastest possible retrieval of some sectors. In that case, for these fastest-retrieval sectors, the miners could keep the sectors in an even-more-extracted state. It might be the case that the rewards are such that they can tolerate an occasional single fault (since they will always be able to recover the data for the next proving period). However, even if that is true they would need to restrict this to a small portion of their sectors to avoid this happening often.

Therefore, they could also keep a portion of their sectors in some faster-retrieval state then perform just enough extraction during each proving period to ensure that they will be able to answer any challenge. Then, once the challenges have been received, they can return the sector to its more-extracted state to serve any requests which might come in.

What is the benefit of doing this vs just waiting until a request comes in and throwing more parallelism at extraction? Maybe none, in some cases. However, there are problems with maximizing parallelism. First, when the required parallelism becomes too large, it complicates the problem itself. Second, in order to serve many high-parallelism requests at once, a very high total number of cores needs to be available. Therefore, a miner who wants to provide high-availability of many fast-retrieval sectors either needs a very large amount of on-demand compute, or needs to find a way to spread the work out over more time (so less intense bursting is required).

Partially-extracted caching is a way of addressing this problem. Note that this is completely flexible. Any sector can be kept in and transitioned to any percentage extraction at any time. The more slowly these transitions happen, the lower the overall hardware requirements are. Just as true retrieval miners have opportunities to be very clever in their algorithms for anticipating demand, so do storage miners using this strategy.

This possibility suggests that the most successful players will be those with robust information about what they are storing so they can make good guesses about what pieces should be in what state. This extends to knowledge at packing time. These considerations may work against our preference or assumptions for how the client-miner relationship functions. I'm not sure. I do think this presents a strong argument that the most successful operators will be those who control entities at every layer of the stack. But with good metadata, reputation, and granular contracts, it should be possible for the ecosystem to incentivize and permit good decisions at each layer.

Example: popular celeb tweets just so, and the year 1987 is trending. Since the savvy miner is always rebalancing his sector retrieval profile anyway, he makes use of this information and ensure that his pop data nearest to the celeb in the content graph is most available. Likewise true retrieval miners begin pre-emptively caching similar material. In the spot pricing battle for individual pieces (or complete 'catalogs'), the miners who have optimized best win and reap the rewards. (This is reminiscent of high-frequency trading, and I think if there is true demand for fast retrieval and a deep/broad selection of content over time, then we should expect such behavior to emerge. I realize this is not a rigorous argument.)

Regards, @porcuquine

pooja commented 5 years ago

Can someone explain why data extraction will be slow under almost all circumstances?

miyazono commented 5 years ago

Here's my understanding, @pooja, which may be outdated, but hopefully I can save some FIL researchers time so they only give minor corrections. Also, apologies if I'm not providing additional information; I've lost track of who knows what.

To go from a sector to a replica, there's a set of interdependent computations that have to be done (typically represented as a DAG, with raw data on one side and the replica on the other - edges are computations and nodes are intermediate results). The computations and intermediate results were initially intended to be very interdependent in the generation of a replica (to make generating a replica slow), and intended to be highly parallelizeable for extraction. However, the computations (which are reversible) still have to be undone to extract. That's what porcu means when they wrote

there are problems with maximizing parallelism. First, when the required parallelism becomes too large, it complicates the problem itself. Second, in order to serve many high-parallelism requests at once, a very high total number of cores needs to be available. Therefore, a miner who wants to provide high-availability of many fast-retrieval sectors either needs a very large amount of on-demand compute, or needs to find a way to spread the work out over more time (so less intense bursting is required).

Specifically, to extract data from a replica will likely be slow unless miners do things like set up their compute capabilities for highly parallelized extraction. It may be simpler to do things like cache partially-extracted data. Either of these would make extraction faster (but not necessarily cheaper, since it would take additional hardware specifically and solely tasked to make retrieval mining faster).

porcuquine commented 5 years ago

Can someone explain why data extraction will be slow under almost all circumstances?

I don't know how you reach that conclusion. The problem is probably that we're using words like 'slow' and 'fast' as though they have absolute meaning, whereas in reality they only mean something relative to something else, and in a specific context. Presumably, it's this mismatch which makes it possible for you to ask a question whose premise I disagree with so completely. That is, it's not a given that data extraction will be slow under almost all circumstances. I mention this not to be argumentative but to point out that productive discussion probably requires that we stop talking about 'fast' and 'slow' retrieval without a lot more qualification. Otherwise, the discussion starts to take on a pro-choice/pro-life quality which I think increases noise. (I'm not pointing fingers at anyone here: noting a latent pattern in the discussion.)

pooja commented 5 years ago

Thanks @miyazono.

@porcuquine I was asking because miyazono's original issue starts with:

Data extraction is likely to be slow/costly under most (but not necessarily all) circumstances. As a result, Seal Expansion as a proposal is less useful, since it was intended to mitigate SealStack, assuming a fast extraction was possible. However, for the sake of clarity and openness, I'm adding details of the Seal Expansion proposal below with some of the discussion that took place.

pooja commented 5 years ago

To make this discussion less noisy, can you quantify (rough order of magnitude) speeds of extraction?

nicola commented 5 years ago

unsealing costs the same as sealing in terms of CPU cost, but it's parallelizable, so the time it takes to unseal should be roughly 1/number of cores of the sealing time

porcuquine commented 5 years ago

@porcuquine I was asking because miyazono's original issue starts with:

Data extraction is likely to be slow/costly under most (but not necessarily all) circumstances. As a result, Seal Expansion as a proposal is less useful, since it was intended to mitigate SealStack, assuming a fast extraction was possible. However, for the sake of clarity and openness, I'm adding details of the Seal Expansion proposal below with some of the discussion that took place.

Got it. I tried to address in my comments above. Specifically, there are many cases where extraction need not be slow. At the end of the day, extraction speed is bounded only by a combination of CPU and block time — as far as I know. The more CPU, the faster you can go with no limit except parallelization overhead. The smaller the sector, the less parallelism required to achieve a given extraction speed — but there is a lower bound to sector size determined by the block time.

filecoin-project / specs

Seal Expansion #433

Original Problem:

The Proposal

Brief Comment from @sternhenri

Additional information and Concerns from @nicola

Additional comments

@jbenet commented

@porcuquine's comments

cont.