Step back to FindNeighbors instead of FINDNODE

zilm13 commented 4 years ago

Hi, @fjl

I made some research with simulation including case in neighbors lookup we were talking about: in reply to query FINDNODE(d) return nodes not only from d, but, for example, from buckets with smaller index, if there are less than 16 ENRs in d bucket.

Research is here: https://hackmd.io/@zilm/BykU7RRGL Simulator made in Kotlin is here: https://github.com/zilm13/discv5/tree/161315190a647552aec64e800c13e92aa89a5282

My findings is that returning nodes from <= buckets doesn't work: while being better than current V5 method with young nodes, it becomes worse in a mature network, where most nodes have top buckets full. So I think that it's not a good idea.

But my research shows that overall V5's FINDNODE method always uses more traffic and is slower than V4's FindNeighbors. Traffic is not an issue for Discovery as protocol usage is negligible compared to blocks and other, but lookup time is an issue, especially when we are going to use it in topics search. So, I think it's better to step back here and use method with peerId/hash input.

But, if we stick with FINDNODE, you should add to specification some changes you already did in Geth:

limit number of FINDNODE requests to 3 per every node lookup
ask developers to verify distances for every node in NODES reply

It will help other developers to implement V5 better.

fjl commented 4 years ago

Thank you for this very comprehensive report. I think your report shows that the idea to use the bucket index in FINDNODE is not the best choice. We knew it's not optimal, but your text shows we should definitely improve it. I don't want to switch back to the discovery v4 semantics because it is much easier to attack the FindNeighbors message.

One thing I couldn't figure out is whether you are limiting the FindNodesStrict strategy to three requests per node. It doesn't look like it in the code. Maybe we should simulate that too.

Something else that might be a misunderstanding in the Java/Kotlin implementation of the protocol:

Original Kademlia and both V4 and V5 Discovery implementations require testing whether the head peer is alive when adding new peers to any full bucket[21].

In the discovery v5 spec, we explicitly recommend to use asynchronous liveness checks and a replacement cache.

zilm13 commented 4 years ago

I don't want to switch back to the discovery v4 semantics because it is much easier to attack the FindNeighbors message.

The issue is that current V5 semantics could be attacked in the same way. So, what ideas do you have?

One thing I couldn't figure out is whether you are limiting the FindNodesStrict strategy to three requests per node

I've not limited it, because I did simulator from spec, not from Geth code. I checked your code later. And, the set of nodes called Old network provides the similar outcome you get with such limit, because the worse node in that network have 3 top buckets full, and you always have more than 16 nodes in any table. Yeah, there is a small chance that you fall in a small index, but it's really small. I could test it explicitly, with copying your rule but it will be very similar to the results of Mature network simulation.

Something else that might be a misunderstanding in the Java/Kotlin implementation of the protocol

Yeah I remember this, but simulator should be simplified a bit. And I didn't have down nodes of any kind in this simulation.

fjl commented 4 years ago

The issue is that current V5 semantics could be attacked in the same way. So, what ideas do you have?

If the FINDNODE message contains the lookup target directly (like in discv4), an attacker node can return nodes which are very close to the target and make the lookup terminate early with a chosen result. This attack is not possible with the current FINDNODE message in discv5.

fjl commented 4 years ago

(sorry pressed wrong button)

zilm13 commented 4 years ago

@fjl oh, it's another point I'd like to change. In

The lookup initiator starts by picking α closest nodes to the target it knows of from the local table. The initiator then sends FINDNODE requests to those nodes. α is an implementation-defined concurrency parameter, typically 3. As NEIGHBORS responses are received, the initiator resends FINDNODE to nodes it has learned about from previous queries. Of the k nodes the initiator has heard of closest to the target, it picks α that it has not yet queried and sends FINDNODE to them. The lookup terminates when the initiator has queried and gotten responses from the k closest nodes it has seen.

To improve the resilience of lookups against adversarial nodes, the algorithm may be adapted to perform network traversal on multiple disjoint paths. Not only does this approach benefit security, it also improves effectiveness because more nodes are visited during a single lookup. The initial k closest nodes are partioned into multiple independent 'path' buckets, and concurrent FINDNODE requests executed as described above, with one difference: results discovered on one path are not reused on another, i.e. each path attempts to reach the closest nodes to the lookup target independently without reusing intermediate results found on another path. Note that it is still necessary to track previously asked nodes across all paths to keep the paths disjoint.

Why do we keep first part? We already have changed original Kademlia and are not obliged to follow it in every part. First method is proved to be easily attacked by number of researchers (for example, Ingmar Baumgart and Sebastian Mies "S/Kademlia: A Practicable Approach Towards Secure Key-Based Routing"). The second option, if we use 3 disjoint paths will not cost anything (though already mentioned article recommends 8 bucket 8 paths/ 16 bucket 4 paths setup). I'd require disjoint paths in spec.

fjl commented 4 years ago

What do you want me to write in the spec? I am trying to make the spec understandable even for people without a lot of knowledge of Kademlia. That's why I think its good to explain the simple lookup first.

zilm13 commented 4 years ago

No, no, I just ask why to allow joint search from first paragraph (which came from Kademlia), as it's very vulnerable. We could require developers to use 3 disjoint paths instead. So not may, but require in second paragraph. As original method is vulnerable.

And I got your idea about not sharing nodeId as protection. I think it's still vulnerable. You could return peers with uniform distribution (something alike of course), so few will be 100% hit on the next step etc. I will try to simulate this on Monday, to see whether it's viable.

update: I guess it will be something like 5*16 (where 5 is number of hops) or 6*16 real network nodes compared to 16 in V4 for the same attack. And no increase in PoW requirements (nodeId generation). 5-6x increase in ip address requirements sounds good, but I still think that amount of additional requests in normal network work is too big penalty for this improvement.

fjl commented 4 years ago

Let's try to think about it in another way: the lookup algorithm can have these abstract properties:

correctness: does it find the closest nodes at all? does it terminate?
performance: how many hops does it take to terminate?
total size: how many unique nodes, including intermediate nodes, did the lookup visit?

One thing to realize is that for discovery this last property (total size) is more important than performance. We use lookup for these three things:

As a way to walk the DHT in a randomized way. The goal here is to visit many unique nodes.
To find our own neighbors for bucket refresh.
To find the newest ENR of a specific other node for the 'resolve' operation. 'resolve' is almost never used. It's mostly for debugging.

Out of these three, (1) happens most often, but if you think about it (1) doesn't even need the correctness property. We need correctness for (2) and (3). We need performance for (3) only. This is why I think it's OK not to optimize the lookup for performance.

Maybe we should try to make another kind of simulation to check what the total number of nodes visited during lookup is with those different kinds of FINDNODE implementations/strategies. This would be very good to know.

fjl commented 4 years ago

I totally agree about the spec changes you requested so far:

There needs to be a limit for FINDNODE requests to a specific node during lookup.
Implementations must verify NODES distance.
Implementations "should" (not "may") do lookup over disjoint paths or something like it.

I will add these changes in the next spec update.

I don't want to change the FINDNODE parameter/semantics yet. We can always add another version in a future protocol update. Right now I'm happy you are researching this topic. We still have some time left to do experiments before the discv5 network is really launched, and topics are not implemented yet. I want to try and make the topics work before changing FINDNODE again.

fjl commented 4 years ago

Maybe it would help with the traffic to add a kind of multi-distance version of FINDNODE, like FINDNODE [d1, d2, ...] where the server fills the response buffer with nodes at d1, then d2, ... up to the response limit.

zilm13 commented 4 years ago

@fjl how about let's call it between us Find Neighbors Near The Distance(d) so we get best from both worlds. Function returns k (16) nodes, we don't need any additional requests. And we don't disclose search hash.

update: But it if we decide that we don't need any nodes except d, d+1, d-1 (because everything else is too far), your last suggestion looks cleaner. Sounds reasonable

fjl commented 4 years ago

I proposed the multi-distance request because response validation is still possible with that. For FIND-NODES-NEAR [d] no validation is possible on the client side.

zilm13 commented 4 years ago

@fjl yeah, I like it, we get 1 query and it has clear interface

ethereum / devp2p

Step back to FindNeighbors instead of FINDNODE #141