ethereum / consensus-specs

Ethereum Proof-of-Stake Consensus Specifications
Creative Commons Zero v1.0 Universal
3.56k stars 968 forks source link

Sybil-like NTP-level attack #1592

Open ericsson49 opened 4 years ago

ericsson49 commented 4 years ago

Beacon chain protocol assumes validator clocks are roughly synchronized, e.g. fork choice specs states:

Honest clocks: Honest nodes are assumed to have clocks synchronized within SECONDS_PER_SLOT seconds of each other.

No mechanism to assure the assumption is specified, however. Thus, it's a validator responsibility. It's highly probable that many validators will use NTP to synchonize their clocks to the world standard (i.e. UTC), since it's an easy to set up and alternatives can be expensive.

Then, it's also highly probable that such NTP setups will be using NTP pool. An excerpt from the NTP pool page:

The pool is being used by hundreds of millions of systems around the world. It's the default "time server" for most of the major Linux distributions and many networked appliances

Or they can users servers from other public NTP server lists, e.g. http://support.ntp.org/bin/view/Servers/WebHome.

The NTP pool is free to enter (an excerpt from the the NTP pool page):

Because of the large number of users we are in need of more servers. If you have a server with a static IP address always available on the internet, please consider adding it to the system.

THis can result in an implicit dependency of the beacon chain on the NTP pool (and/or other public NTP lists), if validator are not carefull enough in configuring their NTP servers.

Such situation creates an opportunity for a Sybil attack on the beacon chain protocol, under certain conditions. I.e. an adversary can populate the NTP pool (or any public NTP server list) with lots of Byzantine-faulty NTP servers, which will report wrong time to validator nodes.

NTP protocol can tolerate certain errors, e.g. detect "falsetickers", by comparing results form several NTP servers. However, in case there are many faulty NTP servers in the pool, there is a high probability that a correct server will look as a "falseticker".

NTP pool servers are also monitored by the pool software. However, if the adversary knows IP addresses of beacon chain protocol participants, its faulty NTP servers can report wrong time results only to clients which IP addresses in the list. This is why the NTP servers controlled by the adversary are considered as Byzantine-faulty (two-faced clocks).

If few validators' clocks are distorted by such an attack then beacon chain protocol can tolerate it. However, the key problem with the scenario is that many validators can be vulnerable to the attack, if they are not careful enough when setting NTP up. So, multiple correlated faults can be induced alone or together with other means to attack the beacon chain protocol. E.g. p2p-inteface spec prescribes to delay early messages, then such an attack can be used to delay or break message flow in the beacon chain p2p graph. Note, that as non-validator nodes can participate in the p2p graph, then they can be used to attack the beacon chain protocol too.

The attack is described in more details in a separate document.

It's relatively easy to withstand the attack, e.g. beacon chain participants should be careful when configuring NTP. However, if it's risky to use NTP servers from public NTP server lists, where should they obtain NTP servers?

Using NTP servers controlled by big corporations, non-profit or government agencies can be a possibility, however, it can lead to a similar correlated implicit dependency and lack of decentralization. Which maybe not desirable for various reasons.

Wealthy validators can set up their own time servers, however it increases significantly an entry barrier to run a validator node.

We will elaborate in more details on possible ways of reliable clock synchronization in another document, including BFT clock syncronization solutions and/or anonymous access to public time services (e.g. GNSS, Radio clocks, public NTP servers, etc).

The main goal of the issues is to warn Ethereum 2.0 implementers and researchers that it can be dangerous to rely on the default NTP setup and public NTP server pools and lists. It's also dangerous to assume most validators can set up NTP/time service in a secure manner. Thus, it's a risk to the overall beacon chain protocol.

As very minimal counter-measures, we propose:

These minimal counter-measures are hardly enough, so the best solution would be to design BFT clock synchronization protocol, so that validator and non-validator node adminstrators are relieved from the secure Time Service setup burden. However, such BFT protocol can be prohibitively expensive given expected beacon chain protocol scale (thousands of nodes), so a cheaper solutions are to be investigated too.

We stress out that beacon chain protocol can tolerate limited number of validators with vulnerable NTP setups, so probably a separate BFT clock synchronization protocol can be excessive, if there exists a way to prevent correlated NTP-level failures.

dankrad commented 4 years ago

Has someone detailed what the actual attacks are that you can do if you control the time of any arbitrary number of nodes? Presumably they would concern liveness but not safety?

One of the possible mitigations is to not accept any dates/times that are outside of a given range of the current system time (which should be stable enough to be trusted not to deviate more than a few seconds per day). This is actually implemented in the standard ntpd as the "panic" flag and set to 1000s by default -- by advising validators to set it to a much smaller value, could we mitigate this attack or at least force the attacker to introduce their skew "slowly" which would make it potentially detectable long before the attack can be executed?

ericsson49 commented 4 years ago

Severe clock disparity - around epoch duration (384 sec) can lead to liveness violation, i.e. blocks cannot be justified/finalized. Safety in theory can be compromised too, due to 'inactivity leak', i.e. if validators forced to be inactive, they will be loosing their balances. I analyzed it in more details in the document. Vitalik Buterin also discussed it earlier in his post.

Actually, as p2p-interface requires to block/delay early messages, then any clock disparity > 500ms can impede message flow. If it's not severe, e.g. around slot duration or so, beacon chain should be able to tolerate, however, it can be used to facilitate other attacks. Robustness, e.g. performance under attack will definitely be affected (like validators getting lesser or no rewards).

So, even low disparities looks like a problem and low 'panic' threshold value won't help to prevent all attacks. However, lowering 'panic' flag to fraction of an epoch probably makes sense - to filter out most severe consequences, which prevent inclusion of attestations in blocks. As far as I understand, the ntpd 'panic' of 1000s is needed for initial clock synchronization. After clock is synchronized, one can lower it, because clock drift of more than several seconds a day is indeed unlikely. We've also considered robust clock calibration approach: one can filter out big jumps in time, treating them as outliers, and calibrate clocks afterwards. NTP standard contains an optional algo to do that, as far as I understood. So, a custom solution should be able to limit the ability of attacker to introduce skew to very low level, like a minute a day, maybe even less, since calibrated clocks should be much more predictable.

In general, I think that relying on administrative mitigations is not enough. Some mechanism to detect such problems should be introduced. And if there is a detector then it's a half of a clock synchronization :). BF tolerant clock synchronization protocols exist, the most my concern is they can be expensive and of limited accuracy (given message delays in a p2p-graph). However, one can combine it with robust clock calibration and techniques to estimate network delays more accurately (e.g. filtering out network delay measurements which are outliers, i.e. too big).

ericsson49 commented 4 years ago

Presumably they would concern liveness but not safety?

It's an interesting theoretical question whether safety can be compromised. Accountable safety as stated by Casper FFG paper cannot, as if validators forced to be inactive, it's still validator fault formally.

But beacon chain protocol contain other sub-protocols, so practical safety can be violated, i.e. otherwise honest validators may be forced to be inactive by a time level attack and after their balances fall down due to 'inactivity leak', an adversary which also has enough own validators can reign.

So, a pure time attack should not cause safety violations, but if safety property is adjusted to account for the above, then, in theory, a combined time + malicious validator attack is possible to break the adjusted safety property.

Not sure, I will have time to properly formalize the above, as I think it will be extremely difficult to do in reality, as validators will definitely respond with counter-measures, when their balances start falling.

ericsson49 commented 4 years ago

I've written a draft proposal about how the problem can be solved https://hackmd.io/GnJ_Cf4FSZW-BZImH8KF1w Still need to work more on BFT Clock Sync protocols (will be a separate document) and analyze the proposed solution more rigorously.

dapplion commented 12 months ago

After the Medalla incident on mid-2020 the consensus (I've seen) is that we should use existing clock syncronization strategies and not roll our own