An attacker should not be able to precompute the KDF

nicola commented 5 years ago

Current setting: a miner in order to encode a node X, must have all the parent nodes Parent(X)=P_1...P_l, P_1 to P_l already encoded. Note that the last node that a miner encodes before X is X-1 = P_l!

Problem: An attacker can pre-compute parts of the hash P_1...P_{l-1} as they encode the graph and find P_i-s (except P_l since it's only generated just before encoding X), so that in order to encode X, they only need to complete the hash for P_l. This would speed up the KDF if a miner is willing to store extra data (max 32bytes per node, but probably less)

Solutions:

Either pick a hash function h where h(P_1...P_l) cannot be computer unless P_1..P_l are available (AES?),
Or, pick a hash function h and hash the parents h(P_l...P_1), where h can be pre-computed sequentially. In this case the attacker cannot pre-compute the hash unless they have P_l (which they won't have).

porcuquine commented 5 years ago

Doesn't incrementally building the hash require knowing which parents will be concatenated along the way? That part would require effectively manifesting large parts of the graph, I think. (I'm not addressing the possibility of precomputing something from each hash, just the part about incrementally computing the aggregate hash(es).)

schomatis commented 5 years ago

My two cents (without being an expert on the matter but having been thinking about this when I first read the Rust implemenation of the KDF) is that as a general rule of thumb it's always healthy to prefix any hash function with a distinct identifier to make it unusable in any other context, so besides using the ID of the replica I think it should be also using the node ID.

That said I don't think there is any real danger here. The scenario that always concerned me was the extreme case where we have a maximum expansion factor of n and every node l had as parents every other preceding node. In this pathological case that attack could be very effective: for, say, each even node l save the "partial" computation of the hash H(sigma || C_v_1 || C_v_2 || ... || C_v_(l-2)) that we'll name here PH_(l-2) (partial hash up to and including the label of the parent l-2), and discard all the computed labels (I'm counting this as saving roughly half the graph, equating the partial hash state as something similar to a label in size). So, to reconstruct the entire replication, the adversary would need two rounds, (1) complete the partial hash PH_(l-2) (this adds the necessary padding and actual length internally but we can ignore it here) and perform the VDE.Enc to generate the (deleted) C_v_(l-1) and (2) extend the PH_(l-2) to add the just generated label C_v_(l-1) to PH_(l-2) (upgrading it to PH_(l-1)) and generate C_v_l. (This can easily be extended to retain any fraction -not just the half- of the PH_l achieving the smooth time/space trade-off of the faulty Basic-VDE-PoRep.)

The sigma I understand is a global parameter (a replication ID or similar) that doesn't specialize the hash for the particular node it is working on, making PH_l reusable (compare with an alternative KDF that would use the node id in the hash function, H(sigma || l || C_v_1 || C_v_2 || ..., that would prevent this attack).

This actually can't happen, at least in our implementation, because we "pad" nodes (e.g., in the forward direction with node 0) to have exactly d parents. So in the extreme case d=n no node would have a common prefix, e.g., node 0 and node 1 would have the parents array [{n_0}n], node 2 would have [{n_0}n-1, n_1], then [{n_0}n-2, n_1, n_2] and so on. (Not actually those proportions of repeating n_0s since the expansion parents assignment is a random permutation and we may repeat n_1 0 or more times, but no node would share the same prefix with neglibigle probability).

In the actual implemented case of a normal expansion factor like 8, the chances of repeating prefixes are also very low (and this doesn't even factor in the parents generated in the base graph with the bucket sampling algorithm). In this case, for the attacker to effectively "save" some computation by storing partial hashes instead of computed nodes you'd need to find L strings of common prefixes that would contain M (>L) parents that only appear in those prefixes. In that (very unlikelyl case) you'd save the M-L difference by storing those prefixes instead of the actual node parents. (That would be a game where you'd start with all nodes and no partial hashes stored, and see if you can exchange a group of partial hashes for a bigger group of nodes to save the difference, I'm simplyfing it here but that game would have many restrictions to be profitable I think).

Everything boils down I think to the question: is there a non-negligible probability to have many nodes with a common prefix of parents that would help reduce parallel rounds? To which I guess the answer is no.

porcuquine commented 5 years ago

I think you're onto something and that it could be corrected easily. Currently, we always sort the final parent set (ascending). As you note, we also pad with the first node (in the forward direction — the last in the backward direction).

On average, we expectexpansion_degree / 2 padding. So unless we happen (very unlikely!) to have a node with no padding (i.e. all candidate expanded parents were accepted in this direction), then every forward node's KDF pre-image will begin with replica_id || nodes[0]. Most will begin with replica_id || nodes[0] || nodes[0], etc. And the average case will have four copies of the first node. It would be cheap to save all the prefixes and save hashing the prefixes in all these cases.

The problem doesn't arise when encoding the reverse direction, because the padding is sorted to the end, and there's no likelihood or way to anticipate any collisions between two nodes' 'nearest parent'.

NOTE: Technically, since the graph is constant, one could precompute everything, search for prefix matches across the whole set of preimages, then 'compile' a version of encoding which knows which parent nodes need to be stored into prefixes, and which nodes should consult prefixes — and provides hints during encoding so this can happen efficiently. This is the general shape of what all attacks on the graph structure would look like, I think: manifest the whole graph and look for exploitable patterns, then build an encoding or proving routine which exploits what was found in the manifest graph.

If we want to get rid of this:

The simplest thing would be simply not to sort the parents at all. Since parent-generation is deterministic, sorting doesn't add anything except for an increase in this attack surface.
We could also pad with the last parent calculated rather than the first/last node, which would further protect against this — though would not make a difference as long as we still append (to end) the expanded parents to the drg parents.

One final thought, though: since the possibility of prefixing doesn't change the difficulty of proving — just makes KDF-generation a little cheaper on average — what if we went in the other direction and reversed the sort-ordering in the reverse direction, so the repeated padding sorted first (descending) in those cases too. Now pre-computation of the prefix is always possible and easy. We implement this optimization in the standard encoding, and adjust the expected cost of encoding. This makes replication a little cheaper, but in away no one can exploit. (Cheaper replication that gives no one a speed advantage would be good for us for performance reasons, I think.)

schomatis commented 5 years ago

I agree with your comments about ordering, but just to clarify my analysis above (and I may have misunderstood the premise of this issue) doesn't concern about KDF computation times (I'm always assuming KDF could be instantaneous and that all the delay of the replication should rely on the VDE) but it's precomputation only as a way to avoid encoding other (parent) nodes (and hence avoid the delay in their VDEs).

Saving the partial hash of a node just to speed up the its KDF seems like a bad deal to me (even for the repetitive padding parents) when you can use that same space to store the node itself and bypass its (more expensive and supposedly true bottleneck) VDE. (Storing precomputed KDFs of many padding node 0s is also not a very good deal in itself because the more node 0s you precompute the less nodes will have that such long prefix and the smaller target you'll be able to use it for.)

I only cared about keeping KDFs to avoid VDEs, and normally trading one for the other is not worth it unless there are specific prefixes (that include nodes other than padding) that are repeated enough to significantly increase the chance of not needing to encode a parent node to replicate any (random) node, (virtually) decreasing the dependency path.

nicola commented 5 years ago

Short term resolution: every hash should be prefixed with the latest encoded node (hashed nodes are in descending order).

nicola commented 4 years ago

This seems resolved

filecoin-project / research

An attacker should not be able to precompute the KDF #104