Proof-of-storage - Githubissues

bedeho commented 3 years ago

Currently, we are relying on two distinct mechanims to induce storage providers to actually replicate what bags the chain obliges them to

Risk of being detected as not doing so, and slashed by discretion of storage working group lead.
Lost revenue from not being picked to be synched from

These incentives are fine, but they are obviously imperfect. There is monitoring cost and collusion risk in 1) and for very low popularity content the net marginal bandwidth revenue could hypothetical be lower than expected storage costs in 2).

We could throw one additional, relatively simple, incentive into the mix, which is to use the same proving mechanism that Sia has, where basically the chain will probabilistically challenge a storage provider to submit a fragment of the data it is supposed to store on-chain, and the validity of that data is authenticated in the runtime using a Merkle proof and the root hash initially committed to. No valid reply will result in a slashing. Here it is worth nothing that it could make sense to have two distinct commitments per object, one with large chunks, useful for doing iterative validation of dataobjects during p2p synching, and one with small chunks, used for on-chain challenges, to minimize total footprint on-chain. Conceptually this is super simple, and we are very well placed to do the same in v2, as we are literally storing a data hash in the state as well (for other temporary reasons actually). This mechanism is of-course also not perfect, e.g. there is a risk of

storage providers attempting to opportunistically download the data on-demand form other providers to respond to the challenge. This risk can be partially mitigated by making the time bound on the response very tight, e.g. 2-3 blocks, which imposes serious risk of peer providers not responding in time.
it adds size to the ledger, as the size of the data and the proof, takes some space. To minimise this problem we really need tocome up with some decent scheme for making the original Merkle commitments so that the tradeoff between the size of the challenge data and the merkle branch leads to the minimal total size.

The main complexity would be in dealing with the fact that the liason may, in order to maliciously get peers slashed, confirm uploads that have invalid hashes, or simply withhold synching to peer providers so they don't actually get the data. This could be addressed by either

a) only challenging liasons, if they exist. This is a bad alternative, as over time liasons will tend to not be part of the system any longer, or not part of that same bag. b) re-introduce the concept of storage providers signalling when they have replicated a data object, so that the runtime knows who it can asks with what challenge. It also has the added benefit of making this knowledge publicly verifiable, so that you restrict what providers can claim is causing their malfunction after they have confirmed. The downside is obviously that it adds lots of extra transactions, and additional state. You can avoid the very worst problem of all the cleanup you would have to do when setting a new operator for a bucket, by simply having the runtime assume that anyone set as an active provider for a bucket is opting into having all confirmed objects of the prior provider. c) introduce the leaner concept of providers having to signal that they are fully synchronized with a given bag every time it has had one or more new objects added. So you only need to associate a block number with the relationship between a storage bucket and a bag, if this block number exceeds the block number in the bag for last it was updated, then the runtime knows the bucket operator should have anything in that bag. This signaling has the added benefit of being a nice liveness indicator, and the chain could even slash providers that start to lag too far behind on their synching overall across all bags, something that presumably would not be easy for other providers to cause unless there was a large coordinated censorship effort. d) if one assumes that most of the time, that is during honest faults, a given bucket will only rarely find that it is unable to get some object, or get up to speed with a full bag, despite it being > 24 hours (lets say) since there was a change,then one could flip the representation and instead permit a bucket operator to have a sort of exception list of say 100-1000 things that it is still not able to get. The on-chain challenge protocol will ignore anything on this list. Since the list is so much smaller than the overall set of obligations a bucket has, it is no way to circumvent the probabilistic challenge, it really only works to avoid rare genuine issues. One could imagine a large number of providers trying to quickly censor a given provider, perhaps by even starting to do lots of uploads, so as to try to fill the exception list of a victim provider. This will not be easy or cheap at scale, as the attacks would consistently need to occupy all other buckets on the same bags as the victim, but not impossible, and there is no payoff to the attacker directly. But to deal with this, it could be possible for a provider to enable a circuit breaker signal of some kind, opting out of being challenged, at which point the lead has to step in. Importantly, this emergency signal is not a way of opportunistically getting out of a challenge, as you have to do it a certain amount of time in advance, and you never know when the challenge will hit.

One additional, but quite narrow, technical question is how to make the probabilistic on-chain sampling efficient, and how to make it sample over the storage space, rather than object space. Sampling over the storage space means that each byte of data has an equal chance of getting audited, while sampling over objects means that storage providers can speculate and try to only store small objects, since they are cheaper. Its not at all clear how much of a difference this distinction makes, and one can calibrate the challenge protocol to have probabilities and penalties tuned to hit large objects sufficiently frequently.

┆Issue is synchronized with this Asana task by Unito

traumschule commented 2 years ago

Because sync status signals by providers cannot be trusted without proof the concept of on-chain challenges seems more achievable:

Here it is worth nothing that it could make sense to have two distinct commitments per object, one with large chunks, useful for doing iterative validation of dataobjects during p2p synching, and one with small chunks, used for on-chain challenges

Rationale

The main motivation for providers to be fully functional (by actually providing all assigned assets in full) is to a) being advertised (linked to) by atlas or other apps, b) earn from bandwidth revenue by providing asked for content, c) not getting slashed. Because as noted above b) only provides an incentive to provide highly demanded assets not to store all assets equally. The mechanism for c) could be automatic (like for validators) or manual assuming the lead has effective mechanisms to evaluate and slash all providers considering all content and is interested and able to do so without discrimination. There are obviously several risks connected with having one person / role in control of all providers if they are for example biased towards trending content versus low demand high-quality content and neglect to slash selected providers or other existing biases.

To actually ensure that all content is available all the time the DAO runtime needs a mechanism that probes all providers for random chunks of data every block and rewards or slashes based on the result.

Why every block? The main benefits are to a) know at chain level which active providers are online, and b) being able to infer their load based on response time. This would make it technically possible for apps to react to suddenly failing / slow providers within seconds and immediately divert requests to better performing alternatives. The user would not even need to notice a difference if a stream gets interrupted midway and the asset is requested from another source, as long as the buffer has enough playback time.

Scope

ProofOfStorage events inform participants about the status of active providers:

workerId / groupWorkerid
type: storage, distributor, bandwidth, ..
status: success | fail | pending
response_time: milliseconds

Outcome:

The amount of successful mentions per era / session can trigger additional minting for that provider.
The amount of failed or slow responses can trigger slashing.
Parameters for both need to be adjustable by the storageProviders WG lead.

The pallet has to produce the event every block to be useful for apps.

Implementation

The status of providers is evaluated by validators. How does it know without actually having the data? By comparing responses among bucket providers (distributors) and matching storage providers (who are supposed to hold all data). Each request needs to be sent to at least two distributors and all matching storage providers.

Beginning at block creation the validator determines a random asset and position for each bucket and sends a request for 64KB of data from that position (equivalent to a ping). It does not need to know how many assets or how long they are. The request consists of assetId (string) and position (integer) meaning seek 64k from <position> modulo <length> of <asset> or since it may undesirable for validators to store asset IDs it can ask for an index meaning get asset <index> modulo <number of assets in that bucket>.
Providers that respond with the correct chunk within 1-5 seconds are mentioned in the event (it needs to be explored if it is worth to store failed status as well).
Providers that a) hit a failed threshold set by the lead, OR b) do not meet a deadline of X successes within N blocks get slashed. How this is implemented depends on what is cheapest to store and query.

Costs

If this is considered for #2923 the effect on block creation time, bandwidth usage, CPU / RAM load and most importantly storage space has to be tested.

Estimated additional bandwidth per provider would be around 40M per era or 1 GB daily.

With increasing number of buckets and providers the network could reach a point where one validator is unable to process responses of all providers within one block. What are those numbers?

bedeho commented 2 years ago

My understanding of this is that you want validators to somehow audit storage providers & distributors, and then emit record of this on-chain, and this is used to slash those actors or reward them on-chain?

My problem of this would be that validators don't have an incentive to do this well, they are not accountable to anyone, they could either maliciously attack providers, or not bother doing proper audits. Unlike the lead, who is accountable to the council, and can be both slashed and also lose all future revenue from not getting access to the role again by tarnishing their reputation. Lastly, I don't see the benefit of this compared to the automated proof-of-storage, where providers have to prove it to the runtime actively.

traumschule commented 2 years ago

Validators can be slashed too. What mechanisms are in place to ensure that they produce blocks correctly? Similar methods could be applied here as well.

To assume that validators could run an unaudited runtime (for profit or attacks) means we need assess potential harm that could be done already. Is there an issue for this?

Would this be a good use-case for a meta-protocol? (#1990)

bedeho commented 2 years ago

Validators can be slashed too. What mechanisms are in place to ensure that they produce blocks correctly? Similar methods could be applied here as well.

Not really, because you can actually cryptographically verify whether a validator produced a valid block, or produced a double signed block, that is why the consensus protocol can slash autonomously, this is not the case for this other verification because there is no objectively verifiable proof of the audit you have mentioned.

To assume that validators could run an unaudited runtime (for profit or attacks) means we need assess potential harm that could be done already. Is there an issue for this?

Any non-consensus code, such as what you are describing, is not difficult. For example, offline validator workers in Substrate can be turned on or off by the operator.

Would this be a good use-case for a meta-protocol?

I am not sure what the link would be here.

Joystream / joystream

Proof-of-storage #2555

Rationale

Scope

Implementation

Costs