Open miyazono opened 5 years ago
Can someone explain why data extraction will be slow under almost all circumstances?
Here's my understanding, @pooja, which may be outdated, but hopefully I can save some FIL researchers time so they only give minor corrections. Also, apologies if I'm not providing additional information; I've lost track of who knows what.
To go from a sector to a replica, there's a set of interdependent computations that have to be done (typically represented as a DAG, with raw data on one side and the replica on the other - edges are computations and nodes are intermediate results). The computations and intermediate results were initially intended to be very interdependent in the generation of a replica (to make generating a replica slow), and intended to be highly parallelizeable for extraction. However, the computations (which are reversible) still have to be undone to extract. That's what porcu means when they wrote
there are problems with maximizing parallelism. First, when the required parallelism becomes too large, it complicates the problem itself. Second, in order to serve many high-parallelism requests at once, a very high total number of cores needs to be available. Therefore, a miner who wants to provide high-availability of many fast-retrieval sectors either needs a very large amount of on-demand compute, or needs to find a way to spread the work out over more time (so less intense bursting is required).
Specifically, to extract data from a replica will likely be slow unless miners do things like set up their compute capabilities for highly parallelized extraction. It may be simpler to do things like cache partially-extracted data. Either of these would make extraction faster (but not necessarily cheaper, since it would take additional hardware specifically and solely tasked to make retrieval mining faster).
Can someone explain why data extraction will be slow under almost all circumstances?
I don't know how you reach that conclusion. The problem is probably that we're using words like 'slow' and 'fast' as though they have absolute meaning, whereas in reality they only mean something relative to something else, and in a specific context. Presumably, it's this mismatch which makes it possible for you to ask a question whose premise I disagree with so completely. That is, it's not a given that data extraction will be slow under almost all circumstances. I mention this not to be argumentative but to point out that productive discussion probably requires that we stop talking about 'fast' and 'slow' retrieval without a lot more qualification. Otherwise, the discussion starts to take on a pro-choice/pro-life quality which I think increases noise. (I'm not pointing fingers at anyone here: noting a latent pattern in the discussion.)
Thanks @miyazono.
@porcuquine I was asking because miyazono's original issue starts with:
Data extraction is likely to be slow/costly under most (but not necessarily all) circumstances. As a result, Seal Expansion as a proposal is less useful, since it was intended to mitigate SealStack, assuming a fast extraction was possible. However, for the sake of clarity and openness, I'm adding details of the Seal Expansion proposal below with some of the discussion that took place.
To make this discussion less noisy, can you quantify (rough order of magnitude) speeds of extraction?
unsealing costs the same as sealing in terms of CPU cost, but it's parallelizable, so the time it takes to unseal should be roughly 1/number of cores of the sealing time
@porcuquine I was asking because miyazono's original issue starts with:
Data extraction is likely to be slow/costly under most (but not necessarily all) circumstances. As a result, Seal Expansion as a proposal is less useful, since it was intended to mitigate SealStack, assuming a fast extraction was possible. However, for the sake of clarity and openness, I'm adding details of the Seal Expansion proposal below with some of the discussion that took place.
Got it. I tried to address in my comments above. Specifically, there are many cases where extraction need not be slow. At the end of the day, extraction speed is bounded only by a combination of CPU and block time — as far as I know. The more CPU, the faster you can go with no limit except parallelization overhead. The smaller the sector, the less parallelism required to achieve a given extraction speed — but there is a lower bound to sector size determined by the block time.
I'm including some notes below from the email thread on Seal Expansion. This is related to this issue around reducing the marginal cost of retrieval.
Original Problem:
It was my impression that the retrieval market was likely be slow because replication generation was going to be symmetric to combat SealStack, and therefore data extraction from the replica would also be slow. However, I have since been told that
The Proposal
Approximately as emailed by @ianjdarrow:
I'm going to start summarizing here, given how long this already is. If anyone feels like I've misrepresented anything you've said, please edit or add to this issue. If anyone else would like to know more information about what was said, please feel free to comment.
Brief Comment from @sternhenri
Additional information and Concerns from @nicola
Additional comments
@jbenet commented
@porcuquine's comments
Copied for simplicity, since there's too much to summarize Before reading this I had conversations with both Ian and Evan about retrieval in general. Ian asked me to write some of them down in this thread. I'm going to paste in some notes below. They don't really address expansion directly, but they do provide some more thoughts about retrieval (and perhaps some more detailed information for those not familiar). In any case, my intent isn't to derail — just to provide more thoughts about considerations of fast retrieval. I think it's very important, and it's one of the factors motivating multiple sector sizes. The idea is that we need to provide enough flexibility to allow miners and clients to negotiate good outcomes across the natural range of tradeoffs.
The main new (to me) idea presented below is the idea that partially-extracted caches can co-exist with Rational PoSt to allow miners more economic flexibility to choose fast retrieval if they are rewarded for providing it. The rest is mostly review of techniques for fast retrieval.
With apologies for a crude and hasty composition — wanted to just get those ideas written down while I have time.
Regards, @porcuquine
cont.
I'm going to dive a little more into the partial caching idea, because I may have left too much to the imagination.
The idea is this: in general, miners are allowed to delete a certain amount of data and reseal it just in time. Because of the polling time, this is limited. The first suggestion is that miners who prioritize retrieval take advantage of this opportunity to keep all of their fast-retrieval sectors in a partially-extracted state. Then, when any of those sectors is challenged, they simply complete the partial replication in time to meet the challenge.
But we want to do better than that because we assume someone is motivated to provide fastest possible retrieval of some sectors. In that case, for these fastest-retrieval sectors, the miners could keep the sectors in an even-more-extracted state. It might be the case that the rewards are such that they can tolerate an occasional single fault (since they will always be able to recover the data for the next proving period). However, even if that is true they would need to restrict this to a small portion of their sectors to avoid this happening often.
Therefore, they could also keep a portion of their sectors in some faster-retrieval state then perform just enough extraction during each proving period to ensure that they will be able to answer any challenge. Then, once the challenges have been received, they can return the sector to its more-extracted state to serve any requests which might come in.
What is the benefit of doing this vs just waiting until a request comes in and throwing more parallelism at extraction? Maybe none, in some cases. However, there are problems with maximizing parallelism. First, when the required parallelism becomes too large, it complicates the problem itself. Second, in order to serve many high-parallelism requests at once, a very high total number of cores needs to be available. Therefore, a miner who wants to provide high-availability of many fast-retrieval sectors either needs a very large amount of on-demand compute, or needs to find a way to spread the work out over more time (so less intense bursting is required).
Partially-extracted caching is a way of addressing this problem. Note that this is completely flexible. Any sector can be kept in and transitioned to any percentage extraction at any time. The more slowly these transitions happen, the lower the overall hardware requirements are. Just as true retrieval miners have opportunities to be very clever in their algorithms for anticipating demand, so do storage miners using this strategy.
This possibility suggests that the most successful players will be those with robust information about what they are storing so they can make good guesses about what pieces should be in what state. This extends to knowledge at packing time. These considerations may work against our preference or assumptions for how the client-miner relationship functions. I'm not sure. I do think this presents a strong argument that the most successful operators will be those who control entities at every layer of the stack. But with good metadata, reputation, and granular contracts, it should be possible for the ecosystem to incentivize and permit good decisions at each layer.
Example: popular celeb tweets just so, and the year 1987 is trending. Since the savvy miner is always rebalancing his sector retrieval profile anyway, he makes use of this information and ensure that his pop data nearest to the celeb in the content graph is most available. Likewise true retrieval miners begin pre-emptively caching similar material. In the spot pricing battle for individual pieces (or complete 'catalogs'), the miners who have optimized best win and reap the rewards. (This is reminiscent of high-frequency trading, and I think if there is true demand for fast retrieval and a deep/broad selection of content over time, then we should expect such behavior to emerge. I realize this is not a rigorous argument.)
Regards, @porcuquine