PoSv3 Time-Drift related problems

castarco commented 5 years ago

First, some well known facts:

In PoSv3, we use timestamps to compute the kernel hash for the block we are trying to propose (these timestamps are "masked" to decrease their granularity).
The majority of PoSv3 implementations (including ours) use the system time to compute the block timestamps, without trying to explore the whole set of potentially valid timestamps (according to the validation rules).
There's a range of valid timestamps for every block. Its lower boundary (not inclusive) is the median of the past N (usually 11) blocks. Its upper boundary is usually the current system time plus some extra margin. In the case of Particl, this margin equals to 2h, in other cryptocurrencies, the margin is smaller.
We have a time window to avoid network splits due to out-of-sync clocks, it's literally impossible to ensure 100% synchronized blocks for all the network, so we have to deal with that.

Now, the issues:

When the majority of nodes are "naive", in the sense that they use the system time, and don't try to explore all the valid timestamps when they try to propose, the "greedy" nodes can have a drastic advantage. Let me elaborate on this point:
- Some PoS-fanatics argue that this is an exaggeration or delusional ( https://talk.peercoin.net/t/timedrift-exploit-on-pos-coins-to-increase-minting-probability ). They claim that the search space is small, and that, once it has been explored, only one single hash per second can be computed.
- Their arguments are completely flawed, even if the search space is "small", 100 is still 100 times bigger than 1.
- Even if some seconds after receiving a block the search space has size 1 again, the first instant is still critical, because it gives a considerable amount of lottery tickets to the proposer.
Not only the greedy proposers use more "lottery tickets", they also have the ability to kick out of the competition the "naive" nodes. If they push the block timestamps to the future, then the timestamps in the blocks proposed by the naive nodes will fall behind the past median time. This is easy to solve, but it's not solved in the current implementation.
Lets assume for now that, given that the greedy nodes have some advantage, everybody starts using a greedy implementation. Then we face new problems: the difficulty adjustment mechanism is screwed up because the timestamps are shifted (usually to the future), and even if they are not shifted, they become noisier.
- When the timestamps are shifted to the future, the difficulty becomes too low.
- When the timestamps are noisy, the difficulty becomes bumpy, affecting the throughput.

Some considerations:

This issues are not just theoretical:

Other cryptocurrencies have faced similar problems (and some of them have been attacked using this behavior, https://bitcoinist.com/interview-presstab-pos-vulnerabilities/ ).
We ran several simulations, with different consensus parameters, and we were able to see all the predicted effects: noisy timestamps, shifted timestamps, badly adjusted difficulty, and huge advantage for the greedy nodes:
- For 2h-windows, 120s between blocks, and 16s masks, simulated for 48h, where the greedy node started with just 1/5000 of the stake (50 proposers, 75 relay nodes), it had 77 times more probabilities to propose than what its stake would allow in a "fair" competition.
- For 10 minutes windows, 16s between blocks, and 4s masks, simulated for 48h, where the greedy node started with just 1/5000 of the stake (50 proposers, 75 relay nodes), it had 38.58 times mores probabilities to propose than what its stake would allow in a "fair" competition. It's plausible that the smaller(38.58 vs 77) advantage can be attributed to smaller time window.

This is how the block timestamps are shifted relative to the "real" time with 600s time windows, 16s between blocks & 4s time mask. This is an extreme case because the greedy nodes started with a relatively huge stake, and the naive nodes were "kicked out" because their timestamps were also behind the past median time, so take it with a grain of salt (I just recovered the image from my first experiments). shifted_timestmps

The next picture is a more realistic case (extracted from a 48h simulation), using our settings (but with a 10 minutes time window, negative values mean that the timestamp is shifted to the future in this plot): shifted_timestmps2

Proposed mitigations:

Having small time windows (in the seconds or minutes scale). This is the most obvious one, proposed also by other colleagues and applied in other projects.
Change kernel hash calculation:
- All the nodes are greedy by default, so they perform a limited version of PoW.
- We recover the concept of nonce, with the following considerations:
- We don't use the timestamp for the kernel hash.
- We use the nonce for the kernel hash.
- We apply the restrictions that we were applying to the timestamp to the nonce (The nonce is constrained by the upper time boundary & by the past median block time, so the nonce would be similar to a timestamp, but noisier).
- Summarizing: the basic idea is to remove incentives to tweak the block time, so we could have some "drift" on the nonce, but there's no rational reason why the nodes would tweak the timestamps (the next kernel hash won't depend on the past block time, but on the median of many, and luckily, the median is quite insensitive to what happens at the boundaries of the distribution).

Notice that these mitigations are not a complete solution. Changing how the kernel hash is computed helps with the noisy timestamps, but still presents the problem of having a "probability peak" centered at the block creation + the average block propagation time (a new block is created, is propagated, and then, all the proposers compute a ton of hashes immediately, then they just compute 1 hash per unit of time after that).

There are also potential mitigations for this last problem, but they wouldn't be part of the consensus rules. Basically, the proposers could compute the kernel hashes for a subset of their competitors, and if they see that their competitors can't propose this round, then they can wait a little bit to relay their new block without the (much) risk of losing the race. The competitors subset could be obtained by using the last N rewards and/or combined stake outputs, where N should be relatively large.

thothd commented 5 years ago

In general fo security issues, I suggest to create a branch with a functional test in an active PoSv3 coin repository (Qtum is a good candidate, maybe Particl) as @amiller is usually doing. Mostly since it requires orchestrating few things, bypassing validations etc, otherwise it's very hard to asses it. It's probably even better to create such an issue on the other running repositories first then maybe refer to it, since then you can show the test attack results and specific times - e.g 2 hours which I couldn't find by the way in Particl ?

castarco commented 5 years ago

since then you can show the test attack results and specific times - e.g 2 hours which I couldn't find by the way in Particl ?

I probably misread some parameter, looked at the wrong place, or looked at old versions. Anyway, the problem is clearly related to how much drift we allow, decreasing the allowed drift is a very effective mitigation.

dtr-org / unit-e

PoSv3 Time-Drift related problems #817