MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
29 stars 4 forks source link

Decentralised NAT Signalling #365

Closed tegefaulkes closed 9 months ago

tegefaulkes commented 2 years ago

Requirements of this design

  1. Maintaining active connections between nodes to form a 'strongly connected' network.
  2. Decentralised connection forming and NAT punching.
  3. Decentralised message relaying.
  4. Possibly caching known connection chains to speed up connection forming.

Additional context

Specification

We need to be able to hole punch a connection to the target node in a decentralised way. This can happen in stages.

  1. Holding connections to form a strongly connected graph.
  2. Discovery of a connected path.
  3. Forming a direct connection using a connected path.

Holding connections

To ensure that a connection path is available we need to have an environment where a chain of connections to the target is likely to exist. To do this each node needs to hold active connections to other nodes. the distribution of these connections needs to ensure that the overall network is strongly connected. Bonus points for minimising the number of hops.

Using the kademlia structure we can maintain active connections to the N closest nodes. but we can also hold connections between other nodes across the network. The closest nodes should maintain strong connections locally in the network while the extra connections will form shortcuts between parts of the network that are not closely related. Note I mean close the kademlia sense of closeness.

Discovery

The discovery process is very similar to the Kademlia FIND_NODE process. the difference here is that we only search over the active connections Proxy connections. Using this process we should be able to find a chain of connections that reaches our target Node.

Generally the idea is that we are walking the graph of active connections to find a chain of connections that leads to our target. Naively this would be a pretty expensive and exhaustive search but utilising the Kademlia structure we can ensure that we only walk closer to our target at each step of the search.

Forming direct connections

Given a chain of connections A>B>C>D we should be able to form a direct connection of A>B. We can form incrementally closer connections by using each step of the chain as a signaller to punch a connection to the next part of the chain. I will refer to this increasingly closer connection as a Feeler.

With the Chain A>B>C>D we can start by requesting B to signal a hole punch between A and C. If the hole-punch is successful then we now have a connection chain of A>C>D. We can then repeat the step use C as the new signaller to form a connection between A and D thus completing the desired direct connection.


 ┌─┐     ┌─┐     ┌─┐     ┌─┐    This is the existing connection chain.
 │A├─────►B├─────►C├─────►D│    We start out by asking Node B to signal
 └─┘     └─┘     └─┘     └─┘    hole punching to node C thus creating a
                                connection between A and C.

 ┌─┐     ┌─┐     ┌─┐     ┌─┐
 │A├─────►B├─────►C├─────►D│    With an active connection to C we can use
 └┬┘     └─┘     └▲┘     └─┘    C to act as a signaller to form a connection
  │    Feeler     │             A and D.
  └───────────────┘

 ┌─┐     ┌─┐     ┌─┐     ┌─┐    We now have a direct connection through
 │A├─────►B├─────►C├─────►D│    the NAT between A and D.
 └┬┘     └─┘     └─┘     └▲┘
  │         Feeler        │
  └───────────────────────┘

At each step the 'signalling' node is directly connected to both nodes and thus is aware of the 'NAT-ified' ports and addresses. It can easily coordinate a hole-punch between the two nodes. So long as we can find a connection chain we can form a connection between two nodes through NATs without a centralised signaller.

If we fail to form a more direct connection then an alternative chain will need to be found. Given a graph that is sufficiently connected then there should be multiple routes to the target. If for some reason a direct connection can't be formed but the target node is holding a connection. This process should ideally connect us with a node that IS holding an active connection to the target node and can act as a relay for said node. It's possible that not all nodes will like to act as a data relay so we may need a way to negotiate with the target node to connect to a relay node that will relay data.

Full process

We can combine the discovery and connection forming to create increasing closer connections to our target. the process will look something like this.

  1. get list of active connections sorted by distance to target.
  2. Pick N out of that list in order of closeness to target.
  3. form a direct connection with the newly found nodes using the previous connections as a signaller.
  4. repeat from step 1.

Eventually this should form a direct connection to our target. We at any point in time concurrently have alpha Feelers walking closer to the target during this process.

If we had existing discovery knowledge about the network such as know connection chains that lead to the target then that can be used to greatly speed up the process. if we had knowledge of an existing connection chain that leads to the target then we could just do the connection forming process without doing any discovery. Likewise if we had a partial chain that we know leads closer to our target then we can use that as a starting point and reduce how much we need to search.

Additional Resources

Sub-Issues & Sub-PRs created

  1. 618

  2. 537

tegefaulkes commented 2 years ago

There are a few things that come to mind need to be addressed for this.

  1. Separate hold punching out of NodeConnection/Manager and just have the Proxy handle it.
  2. Proxy needs to maintain a minimum of active connections as a super set of the NodeConnections to maintain network connectivity.
  3. Out Kademlia implementation needs to be modified slightly to support discovery over active connections.
  4. Implement connection forming across a chain. What do we call this? 'Chain collapsing'? 'Chain reducing'?
tegefaulkes commented 2 years ago

This will need a bit of discussion to refine the idea some more but I think we have a good basis here. For a start I think some refactoring work needs to be done for hole-punching and related messaging.

CMCDragonkai commented 2 years ago

A diagram here would help understand what the problem is.

Compare:

  1. Using gossip/kademlia-driven casting to flood the network with the relay message.
  2. Using 1-hop point to point hole punching, and thus combining the "resolving query" with the "relay query".
CMCDragonkai commented 2 years ago

Also the difference between closest node connections vs closest node entries in NodeGraph and the difference between NodeConnection and Proxy ConnectionForward.

tegefaulkes commented 2 years ago

To maintain the network connectivity we need to maintain a bunch of open connections for each node. It raises some questions.

  1. How many concurrent connections can we have before we're using too many resources?
  2. How many connections does each node need to maintain strong connectivity of the network? is there a point where the network 'percolates'?
  3. If an existing connection is required to start the hole-punch process then is there any value in storing information in the node Graph for nodes we're not currently connected to?
  4. If only existing connections are useful is there any point in having the NodeGraph be persistent in the DB?

The main one is there any point in distinguishing entries in the node Graph are active connections or not? what can we do with the host/port information without having an active connection to it? If there is no use for this information if we're not connected to the node then shouldn't we only store information about nodes we're actively connected to?

tegefaulkes commented 2 years ago

The other approach we considered was to use a kademlia to gossip/broadcast a hole-punch relay message to the target we are trying to connect to. if we picked N nodes out of the nodes we were connected to that are the closest to the target node and we asked them to relay a hole-punch message. If each node receiving this message relayed it to the n of it's connected nodes that were closest to the target node. Then we would eventually get the message to the target node.

This message has some disavantages.

  1. no control over when the message reaches the target node thus we can't coordinate timing on the hole punching very well.
  2. This method has a multiplying affect on the message creating more messages at each stage with each message pointing towards the target. Possibly DDOSing the target.
  3. we would still need some way to get the proper port information that we need to punch through the NAT.
tegefaulkes commented 2 years ago

It may not be useful to store the addrees:port information of NodeIds that are not currently connected in the NodeGraph. But what would be extremely useful is storing the previously seen connection chains. If during the discovery process we can obtain information about previously seen connection chains to our target then we can greatly speed up the process.

Keeping the previously seen address:port can also be useful if the target node can be connected to directly. either in the cases;

  1. target has a public facing IP address.
  2. target's ports were manually forwarded.
  3. target automatically negotiated port forwarding using either, PCP, NAT-PMP, UPnP-IGD or others?
tegefaulkes commented 2 years ago

There is a node library for negotiating port mappings with the NAT. This will be very useful for avoiding hole punching in the first place as well as reducing the amount of keep alive packets we may have to send.

https://github.com/freedomjs/freedom-port-control

CMCDragonkai commented 2 years ago

Persistent connections that are needed to maintain "NAT reachability" will need to be maintained at the NCM level, not at the proxy connections. Because once we have composed a proxy connection, when we terminate the GRPC client, we cannot be sure that GRPC server will handle this correctly without also receiving a TCP FIN packet. At the same time, this avoids adding additional complexity to the connection management at the proxy connection level.

Note that "NAT reachability" is the ability for a node to be reachable behind a NAT even while idle.

CMCDragonkai commented 2 years ago

Assigning @emmacasolin to this issue to be resolved after #357 is merged. It's a good continuation of that PR. Please review this issue details, and we can review it on Monday next sprint.

emmacasolin commented 2 years ago

Something like Dijkstra's algorithm might be useful here for forming increasingly closer connections to the target node, however, we would need to modify the definition of "closeness" to better meet our needs.

Rather than defining the "closest" node (to the target node) using solely the Kademlia way, we could incorporate a node's held proxy connections when deciding which node to use as a signaller. The "closest" node can be defined as the node that we have an open connection to that has an open connection to the closest (Kademlia definition) node to the target node. If we also maintain that every node must at least have an open connection to its k closest nodes then, provided k >= 2, we are guaranteed to be able to find the "shortest" path to the target node, since the next connection we make will always get us closer to the target. The connections made during this process could potentially be cached to speed up connection formation in the future.

Take for example this graph of nodes, where k = 2. The lines are active connections, and some nodes have more than k active connections. If all nodes have exactly k active proxy connections, then using the closest Kademlia-wise node to the target we're connected to will take the same number of connections as taking active connections into account. If any nodes have more than k active proxy connections, however, then using only the Kademlia definition can be either the same or slower.

image

In this example, Node 10 wants to connect to Node 2. If it chooses the next closest node to the target using the Kademlia definition of closest then it will choose 11 as a signalling node, followed by 0, then 1, then finally the target node 2. In this situation three new proxy connections would need to be established.

If it were instead to choose the signalling node as the node with an active connection to the next closest node to the target using the Kademlia definition then it will choose 9 as a signalling node, followed by 1 and then the target node 2, needing only 2 new proxy connections to be established.

tegefaulkes commented 2 years ago

Looks good. This was generally the idea I had about discovery but re-reading the spec I see that it's pretty vague on this.

If I recall my thinking correctly. We would've asked all of our connected nodes for their X closest nodes to the target. Out of that larger set we would select the closest nodes and make connections to them. I think that's in line with your modified closest definition above.

Keep in mind that in terms of optimisation I think it's more important to minimise the number of alternate routes we explore than the number of hops we take to the target.

CMCDragonkai commented 2 years ago

It's a good idea to do a simulation of this in a simple script. You'll have to start off with some cases:

  1. You may know only the NodeID, or you may know the NodeID and NodeAddress. If you don't know the NodeAddress, you can combine the resolving operation with signalling requests some how.
  2. You should be doing "ICE", which means doing both direction connection and signalling simultaneously, and whichever one finishes first wins, and the other operation is cancelled. Assume that we can do this.
  3. The list of active connections you have will always be a subset of your entire node graph contact list. However it may be entirely optimal to only use the the list of active connections because you may be on of the un-active contacts can be contacted to do a signalling. There's likely some tunable parameters/heuristics you may need to be optimal for making use of both the entire contact list and just your active connections (preferring your active connections over unactive contacts obviously).
  4. There may not be a globally optimal solution, so a "good enough" heuristics is sufficient.
emmacasolin commented 2 years ago

If we're only maintaining active connections to the k closest nodes, isn't it possible that sub-groups of close nodes could become "detached" from the rest of the network? Only maintaining connections to the closest nodes seems like it would encourage connections to be clumped together, meaning that clumps of nodes may be only connected to each other and be unreachable to other nodes outside of the "clump".

CMCDragonkai commented 2 years ago

Yea, so we may need to also maintain connections to far-off nodes too to balance it out.

tegefaulkes commented 2 years ago

It bears looking into. But we do want strong 'local' connectivity. I was thinking about maintaining some 'regional' connections spread across all buckets to act as shortcuts during discovery.

I don't think we can end up with disconnected groups if we just rely on the closest nodes though but that is something that should be checked.

CMCDragonkai commented 2 years ago

Imagine k is 20, then if every node in the 20 closest nodes connects to every other 20 node, you have a cluster of close nodes of 20 with no outside connections.

CMCDragonkai commented 2 years ago

Actually I don't know if that's the case. @emmacasolin can you use https://github.com/MatrixAI/js-polykey/issues/158#issuecomment-1066035434 (script: https://github.com/MatrixAI/js-polykey/pull/326#issuecomment-1065905225) and work out a visualisation/simulation.

If the minimum number of connections is 1, then it should be possible to have disconnected pairs.

But if the minimum number of connections is greater than 1, I imagine there's less likelihood for that to be the case, even if all nodes have the same complete node graph. But I'm not sure if this is the case, can you verify this?

Also remember that connections are directional. Our we have a directed graph because connections are client to server.

Also once another node has made a directional connection to you, you should be able to make the reverse directional connection without hole punching.

Active proxy connections may be divided between outgoing and incoming. These are separate.

CMCDragonkai commented 2 years ago

You could use graphology, ngraph or cytoskape. https://github.com/anvaka/ngraph. Related is https://github.com/matrixai/js-polykey/issues/367

CMCDragonkai commented 2 years ago

For future reference, most "visual" aspects of JS ecosystem is outside of nodejs conventions. You're now in the territory of frontend-JS development. Which requires different context. Here are some resources:

Some nodejs packages purport to provide terminal rendering. These are often quite niche, and more advanced visualisations often don't work or are just toys. Simpler visualisations are possible though with https://github.com/vadimdemedes/ink and https://github.com/kroitor/asciichart.

Some nodejs packages support desktop GUIs like imgui and other graphics-libraries. These tend to be OS-specific, and some are cross-platform. These are more advanced of course, and are intended to create general GUI programs.

Recommendations:

  1. If you are cormfortable with web-dev, use the index.html, it ultimately leads to our Polykey-Desktop work which uses electron as a browser frontend
  2. ASCII terminal ones can be fun for the pk CLI.
tegefaulkes commented 2 years ago

I've done some data visualisation in the past using the index.js method + d3.js and canvas. You can easily render what a graph looks like this way.

Here is an example of a force directed graph. https://observablehq.com/@d3/force-directed-graph

emmacasolin commented 2 years ago

I've been running some simulations to work out how we can create a fully connected graph in as few connections as possible, and it turns out that the most efficient strategy is for every node to maintain one connection per bucket (from its own perspective).

This essentially comes down to Kademlia's definition of closeness (XOR) which causes nodes in the same bucket to be considered close to each other. For example, these are the XOR distances between node 7 and its surrounding nodes: image

Since XOR is symmetric, this also means that we're going to be maintaining twice as many connections as we need (since, if Node A is connected to Node B, Node B can easily connect to Node A regardless of there being NAT on either side).

After some trial and error, I've found out the following:

  1. If every node holds x connections, if (and only if) each of these connections is to a node in a different bucket, the number of nodes per clump (of connected nodes that are disconnected from any other nodes) will be 2^x. For example, if every node holds four connections, and these connections are to the 1st closest, 2nd closest, 4th closest, and 8th closest nodes to itself, every clump will have 2^4 = 16 nodes. graph
  2. For any other configuration of connections (whether this be that all of the connections are to nodes in the same bucket, or only two connections are to nodes in the same bucket), if every node holds x connections, the number of nodes per clump will be 2^(x-1). For example, if every node holds four connections, and these connections are to the four closest nodes (1st, 2nd, 3rd, and 4th closest), every clump will have 2^(4-1) = 2^3 = 8 nodes. graph (1)

Keep in mind that this is only true if every possible node id is in use. Additionally, since we only need connections to go one way, not every node needs to hold x outgoing connections for the above properties to hold, so long as outgoing connections + incoming connections >= x for every node. So taking the example from point (1) above we wouldn't need 32*4 = 128 total connections, just 64 is technically sufficient.

This is still a lot of connections though, especially considering the size of our address space. To meet the conditions above we would need to have 256 connections per node (summing incoming and outgoing connections). So it's worth doing some more simulations to look into what happens when there are fewer than the maximum number of possible nodes, which is the most likely situation at any given moment.

tegefaulkes commented 2 years ago

Here is a useful description of what happens when nodes buckets are mapped from one reference NodeId to another. https://github.com/MatrixAI/js-polykey/issues/212#issuecomment-1082551563

Useful to see how one node sees the nodeGraph relative to another node.

tegefaulkes commented 2 years ago

A summary of this is


         ┌──────────────────┬────────────┬─────────────────────┐
 Node A  │255 . . . . . N+1 │   bucket N │ N-1  .  .  .  .  0  │
         └────────┬─────────┴─────┬──────┴────┬────────────────┘
                  │               │           │
                  │               │  ┌────────┘
                  │               │  │
                  │               └──┼───────────────┐
                  │                  │               │
         ┌────────▼─────────┬────────▼───┬───────────▼─────────┐
 Node B  │255 . . . . . N+1 │   bucket N │ N-1  .  .  .  .  0  │
         └──────────────────┴────────────┴─────────────────────┘

  So from Node B's perspective, everything in everything
  in bucket A(N) is in buckets B(N-1:0) and A(N-1:0)
  is in B(N). Bucket A(255:N+1) = B(255:N+1).
  Note that bucket A(N) contains Node B in this scenario.
tegefaulkes commented 2 years ago

I have a better understanding of why the clustering has happening now. Without going too much into the details, each bucket forms it's own cluster. for example bucket 0 would cause every node to form a pair with one other node. bucket 1 would cause a node to pair with 2 other nodes, bucket 2 with pair with 4 other nodes ect ect.

If you form connections just within a single bucket then you would form strongly connected clusters like what we're seeing above. For example, in a network of 8, if you only connected with bucket 0 then you see clusters like 0-1, 2-3, 4-5, 6-7. Bucket 1 would cluster like 0-1-2-3, 4-5-6-7, bucket 2 would form a cluster of 0-1-2-3-4-5-6-7 of the whole network. So the closer buckets form smaller and smaller clusters while the furthest bucket has the potential of connecting the whole network.

Notice how we are subdividing the network with each bucket? bucket 2 is the whole network, bucket 1 forms 2 clusters and bucket 0 forms 4 clusters. You can also view it as each bucket going from 0 to 1 to 2, each bucket forms pairs between the clusters formed in the previous bucket. This is important since only the furthest bucket, bucket 2, can form the connections required to connect the whole network together.

This aligns with what we've observed. If we only pick the closest nodes the we are forced into forming tight clusters of nodes together. likewise if we form a connection to every bucket then we are allowing connections between every possible cluster and thus connecting the whole network. but just selecting from the furthest bucket will naturally do this for us. this is why we saw

  1. clustering when connecting to the closest nodes.
  2. good connectivity when connecting to the furthest nodes.
  3. best connectivity when connecting at random.

So far I've likely explained this very badly so please ask some questions. This is also in desperate need of a diagram to punctuate the point.

I have some more notes to add but let me think through them first.

tegefaulkes commented 2 years ago

So we need to form good connectivity of the network. We can do a decent job of this by only selecting from the furthest know node but this has a caveat.. That would completely ignore how Kademlia finds nodes effectively in the first place. You see the point of this clustering is to ensure that the closer you are to a node the more likely you are to have information about it. when you only store 20 nodes per bucket, you are very unlikely to have information about a node in bucket 255 but almost certain to have information about nodes in the closest buckets.

To have discovery work properly then we need to replicate this with our connections. If we only only maintain connections to the furthest nodes then Kademlia will have a hard time finding it's target. Not impossible, but it feels like it would have to search almost randomly.

In practice this likely looks like;

  1. We need to maintain connections with the furthest bucket to ensure connectivity with the wider network. Otherwise we end up with the clustering.
  2. We need to maintain connections spread across all buckets to ensure that Kademlia can find reasonably close nodes at each step.
  3. Maintain some connections with the closest nodes to ensure almost certain connectivity when we getting close to our target in the search.

We do need to experiment with the distribution and number of connections we can maintain. If we can just maintain a connection to every bucket then it kinda just solves the problem for us.

CMCDragonkai commented 2 years ago

We we tried farthest bucket with bottom-K it still resulted in clustering. Random selection had the best scalability.

tegefaulkes commented 2 years ago

The farthest bucket allows connections across the whole network. to ensure full connectivity of the network at that point is a percolation problem.

CMCDragonkai commented 2 years ago

Just wanted to clarify that #182 is about decentralised relaying. Whereas this issue is about decentralised signalling. Signalling is currently only doable through centralised seed nodes such as testnet.polykey.io. Decentralised signalling has to be achieved first before decentralised relaying. Prior art in relaying would be DERP relays used by wireguard and traditional TURN relays.

Both concepts are an augmentation of the ICE protocol (interactive connectivity establishment).

CMCDragonkai commented 2 years ago

@tegefaulkes brought up the fact that this is necessary if we have a seed node cluster. Because if 2 client nodes are connected to separate seed nodes. The seed nodes have to be connected to each other to be able to perform decentralised NAT signalling, as to signal requires you to have an active connection. So this is related to #403.

I wonder though, if our "exit" IP from the seed nodes will all be the same? I imagine it won't be, because they all have different public IP addresses.

If we move to public and private subnet separation, then the NAT gateway would in fact mean that all seed nodes have the same public IP too, so it would be possible for 2 seed nodes to maintain connections to an agent. However this means multiple NodeIds to a given IP address... forward and reverse.

CMCDragonkai commented 1 year ago

@tegefaulkes

One of the things we realised during our manual testing is that our signaling system is not scalable atm.

In order for the seed node to signal for hole punching, it must be connected to that node.

This implies that all (non-seed) nodes have to connect to ALL seed nodes. So if have 10 seed nodes, then every new node connects to all 10 seed nodes.

This is because assuming the non-seed node (let's call this the client node) is behind a port-restricted NAT. It would not accept packets on its public IP and port from a new source.

So if this client node is connected to seed node A, but not seed node B. The seed node B would not be able to send a message to the client node.

This is not scalable, because we aren't partitioning the workload of signalling amongs the seed nodes. Every seed node duplicating the work of all the other seed nodes.

Furthermore, during the ICE procedure, a client node won't know which seed node has a connection to the target node. Of course if we assume that the target node is also connecting to the seed nodes. Then we could just choose 1 seed node to do the signalling. However if we connect all seed nodes as well, it can end up contacting all the seed nodes to do the signalling. So for N seed nodes, there will be N signalling messages. These extra signallinng messages are unnecessary and just cause useless congestion.

There are some solutions available to us though.

We can make use of "quorum reads/writes" idea. This is used in distributed leaderless systems for replication.

Essentially for a given n nodes, the number of writers and number of readers must overlap meaning w + r > n.

For example suppose we have 11 seed nodes. Then what is the minimal number of connections that every client node has to make such that any other client node that asks to do the signalling will succeed?

It will be something like (n + 1)/2. So if every client node connects to a random set of 6 seed nodes, then when a client node tries to do ICE. It can just ask its own set of 6 nodes. And it is guaranteed that these 6 nodes will overlap with the 6 seed nodes that the target node is also connected to.

The seed nodes can have a signalling policy built in, where it doesn't bother trying to signal if it doesn't have an active connection to the target node. However this policy may be a bit too strict. Not sure.

This idea is known as a "strict quorum". There's an alternative called "sloppy quorum", which I'll examine later.

While this is better than having every client node connect to all seed nodes. It's still a large number, it's also roughly higher than half of all seed nodes.

Consider if it was perfectly partitioned, 2 nodes each do 50% of the total work. But in this strict quorum, with 5 nodes, each node is doing 60% of the total work, with alot of additional wasted work. I calculated this by using the binomial coefficient. With 5 nodes, the w and r is 3. So coefficient of 5 nodes, select 3 is 10. The coefficient of 4 nodes, select 3 is 4.

We should be able to improve this further if:

  1. Sloppy quorum could be used.
  2. Since the NodeIds are already in a ordered structure, could take advantage of the kademlia structure to intelligently signal the nodes.

For 2., I'm not sure because any given node may wish to connect to any other node. So it wouldn't make sense to not connect to an overlapping node. It just seems that every node has to overlap with another node in their connection to a common seed node assuming signalling is necessary.

Perhaps if the seed nodes are able to forward signalling requests that would also help.

amydevs commented 1 year ago

It may not be useful to store the addrees:port information of NodeIds that are not currently connected in the NodeGraph. But what would be extremely useful is storing the previously seen connection chains. If during the discovery process we can obtain information about previously seen connection chains to our target then we can greatly speed up the process.

Keeping the previously seen address:port can also be useful if the target node can be connected to directly. either in the cases;

1. target has a public facing IP address.

2. target's ports were manually forwarded.

3. target automatically negotiated port forwarding using either, [PCP](https://www.rfc-editor.org/rfc/rfc6887), [NAT-PMP](https://www.rfc-editor.org/rfc/rfc6886), [UPnP-IGD](https://openconnectivity.org/developer/specifications/upnp-resources/upnp/internet-gateway-device-igd-v-2-0/) or others?

C lib for PnP that works on POSIX-compliant systems: http://miniupnp.free.fr/ Could be useful.

CMCDragonkai commented 10 months ago

So we are going to need this for Beta CLI launch.

I'm exploring more of this idea here https://chat.openai.com/share/de1b9d93-755b-4fb5-91e8-1288fa7c8300. I need someone to expand this conversation to include what we do with the MDNS discovered nodes, since they are not part of the node graph atm.

It seems a multi-hop signalling system is needed to find the closest nodes to connect from. however, I want this to be as efficient as possible, can the operation to hole punch the closest nodes be done in similar way to how we find the node information? We want to avoid amplification here.

@amydevs comments regarding MDNS required.

@tegefaulkes any new things coming from the nodes domain that has been refactored recently?

CMCDragonkai commented 10 months ago

Also #599 is relevant to this discussion too since the mainnet/testnet page will need to acquire information from the seed node cluster and if nodes are connecting to only m/n of the seed cluster then this may impact the reconciliation of all of the information.

tegefaulkes commented 10 months ago

Information from MDNS is not relevant to this process. address information within a single machine or a lan network is not useful for punching in from an external network.

Besides that, what has changed in the nodes domain since writing this up?

  1. Multi connection logic.
  2. Complete network stack overhaul.
  3. NAT signalling and hole punching better resembles what this spec was expecting. Where it's an atomic operation between a source, target and signalling node.
CMCDragonkai commented 10 months ago

I need to make the operation generic from the perspective of A trying to connect to B. It's important that it's an efficient way if internally it's doing multi-hop hole punching.

tegefaulkes commented 10 months ago

Depends on what you mean by multi-hop hole punching.

This spec describes a method of finding a common node to signal with by doing a Kademlia find operation over active connections in the network. We can optimise this in some ways to minimise the amount of searching that needs to be done. Distributing information what public signalling nodes this node would be contactable on can help a lot. Tracking what nodes are public and using them and starting points for a search can help as well.

But doing something like sending a signalling message across multiple hops. I'm very resistant to doing this. Not that it's not an option. But how do we send a message like that very quickly across multiple hops? How do we discover the connection chain between two nodes remotely without the amplification problem?

CMCDragonkai commented 10 months ago

Multi hop means each time attempting a bridge to help facilitate a connection.

CMCDragonkai commented 10 months ago

Ok I just re-read everything in this issue. This is actually very complex problem that has alot of different kinds of solutions with various trade-offs.

I'm going to try to summarise the situation right now.

The root of this problem is the existence of NAT. Even with IPv6, NAT is likely to continue being a problem. This is because even without a NAT gateway, there may be firewalls that simply default-deny unknown packets. And this is likely to be true on mobile networks regardless of any technology evolution (ignoring the fact that hole punching itself would not work here, but solving this problem is necessary for efficient decentralized relaying too).

Currently we have a fixed set of seed nodes that is deployed into AWS. Let's take testnet.polykey.com as an example. Details on this is being discussed in #599.

Currently all PK nodes upon launching with --network testnet will connect to all well-known seed nodes. The seed node connections are considered special, and they are not subject to the node connection TTLs that all other node connections are subject to. That means they do not expire even if there's no RPC-level activity on them (requires confirmation on https://github.com/MatrixAI/js-rpc/issues/52).

This ensures that all nodes have a port-mapping established.

The network stack has been completely replaced with QUIC. Furthermore, we now only have 1 single QUICSocket using 1 single UDP port. All connections to other nodes are always multiplexed into this single QUIC socket and thus single UDP port.

Suppose our seed cluster has 2 seed nodes. That means right now all nodes launched will have 2 QUIC connections, but 1 QUIC socket, and thus 1 port mapping in their NAT (assuming NAT exists for them).

During a recent testnet 6 test, we were able to connect 2 separate nodes both with hole-punched NAT between Sydney and Austin. This was done with the help of the seed nodes facilitating the signaling operation. Atm, there's a bug with both nodes being able to hole punch each other #605, but there's on-going progress on that from #609 (@tegefaulkes confirm please).

This works because all nodes are connected to the same seed nodes. Which means when node A wants to connect to node B, it can basically ask any or all of the seed nodes S1 and S2 to forward on the signaling message.

In relation to https://github.com/MatrixAI/Polykey/issues/365#issuecomment-1308564815, this means there's a strict quorum, well it's completely overlapping basically.

Now why does an overlap matter? The reason is simple. The seed nodes all have different IPs. In a prior testnet integration #403 we realised that we are not going down the route of a "distributed and load balanced" seed node cluster architecture as this goes against the principle of decentralisation. So therefore because the seed nodes have separate IPs, then during port mapping at the NAT layer, if node B only connected to S2, while node A connected to S1, then the node B's NAT would not accept packets from S1, as its IP and port is not part of the port-mapping structure in B's NAT. Basically they would be dropped as unknown packets.

From a NAT port mapping perspective, it is important to remember that these mappings are temporary, and may disappear after 20 seconds of zero packet activity. Primarily we can assume that they mean a particular local IP and local port inside the LAN, is mapped to a external IP and external port. Because we use the same QUIC socket for a given node, the local IP and local port is always the same, but the external IP and external port will vary for every external node it is connecting to.

So node A and node B must share a common seed node, in this issue's parlance, that is called the "signaling node", simply due to the need for the common signaling node's IP and port to be in place for both A and B's NAT port mapping so that both A and B accept packets from the common signaling node.

This the core problem of this issue.

Now decentralized signaling means achieving 2 things:

  1. Scalability of the seed nodes
  2. Scalability of the entire network

These 2 are sort of interelated, enabling the seed node architecture to scale - and by scalability we mean for every new seed node we deploy, do we get superlinear, linear or sublinear returns to over-all network capacity (and capacity is measured by multi-constrained limits of how big a practical realistic network will be).

The current architecture does not scale, because every node is potentially doing 100% of the work of all signaling operations and 100% of all connections (which is to say, if there was 1000 nodes launched, there would be 1000 connections to every single seed node).

The ICE protocol in each node is asking all the seed nodes concurrently to perform the signaling operation. Although technically it could also just ask 1 out of all of them, with the assumption that all seed nodes are connected. @tegefaulkes can you confirm where in the source code this is decided upon?

Choosing only 1 out of all seed nodes to do this reduces the workfload some-what, however all nodes still have to connect to the common set of seed nodes in entirety.

As per https://github.com/MatrixAI/Polykey/issues/365#issuecomment-1308564815, it is possible to workaround this a bit by introducing strict or sloopy quorums, thus enabling at least one node to be the common seed node. However this introduces the problem of knowing which node is the actual common seed node to perform the signaling role. And such a solution would only work in the context of the blessed seed node cluster.

As an aside, #599 is also discussing the change in the DNS record architecture to allow for there to be a public dashboard of the network. Which actually fits into our designs for private network bootstrapping, and also means there will need to be a change from hardcoding trusted seed node IDs to instead relying on a root certificate for every node which would require #154. Which enables dynamic seed node autoscaling.

Now earlier in this issue, we imagined that getting every node to connect to their closest nodes is a decentralized co-ordination algorithm as it means we can expect every node to have connected to their closest nodes, thus placing a NAT port mapping for their closest nodes, and therefore being able to re-use the kademlia closeness to be able to "find" the potential signaling node to use to connect to a particular target node.

However we later realised, that this kind of decentralized co-ordination would not form good connectivity across the entire network graph. https://github.com/MatrixAI/Polykey/issues/365#issuecomment-1096346552 and #372.

So even if we were to always be actively connected where active means regular packet activity and therefore NAT port mapping is maintained to our closest nodes, if node A was not in the same cluster as node B, then none of the potential signaling nodes will be reachable. This problem is transitive. If the potential signaling nodes are not reachable, then none of the potential signaling nodes of the signaling nodes will be reachable... etc.

Empirical simulation shows that we could have good connectivity with random node connections, but there's discussion about how the reason this works is primarily due to having connections to the farthest bucket. https://github.com/MatrixAI/Polykey/issues/365#issuecomment-1098710348

@tegefaulkes, can you explain what the difference is between spreading random active connections across the buckets, versus just random active connections across the entire graph. My impression is that they are roughly the same.

You say:

We need to maintain connections with the furthest bucket to ensure connectivity with the wider network. Otherwise we end up with the clustering.

This conflicts with empirical summary from the simulation saying that random connections work fine too.

Is your objection due to the fact that setting up random connections doesn't work with kademlia atm, since we always try to find the closest 20 nodes first. And also remember that a node record is only added to the node graph upon a successful connection, meaning that we would likely connect up to 20 nodes to fulfill the bucket size, and this is likely to be spread through the first few buckets.

Then are you saying that we add additional methods into our node graph to also find the farthest 20 nodes too?

I'm just confused as to whether your comments came after the results showing random connections maintained connectivity with least complexity, and therefore is really arguing that while random works fine, structurally we need to fit into kademlia's way of finding nodes to connect to anyway, thus we now need to find closest nodes, farthest nodes... or just give me 20 random nodes across the entire graph? This is in reference to https://github.com/MatrixAI/Polykey/issues/365#issuecomment-1098724503 since you posted that on the same date as https://github.com/MatrixAI/Polykey/pull/372#issuecomment-1098738373.

...MORE COMMENTARY COMING

tegefaulkes commented 10 months ago

Replying to parts of the above comment...

"During a recent testnet 6 test, we were able to connect 2 separate nodes both with hole-punched NAT between Sydney and Austin. This was done with the help of the seed nodes facilitating the signaling operation. Atm, there's a bug with both nodes being able to hole punch each other https://github.com/MatrixAI/Polykey/issues/605, but there's on-going progress on that from https://github.com/MatrixAI/Polykey/pull/609 (@tegefaulkes confirm please)."

Yes, I'm nearly done with #609. #605 is planned but I think addressing memory problems with #598 is going to be done first.

"The ICE protocol in each node is asking all the seed nodes concurrently to perform the signaling operation. Although technically it could also just ask 1 out of all of them, with the assumption that all seed nodes are connected. @tegefaulkes can you confirm where in the source code this is decided upon?"

Hole punching is initiated with NodeConnectionManager.initiateHolePunch in src/nodes/NodeConnectionManager.

" @tegefaulkes, can you explain what the difference is between spreading random active connections across the buckets, versus just random active connections across the entire graph. My impression is that they are roughly the same. "

For connectivity we want a random spread of connection across the network. For searching to work with Kademlia, the 'closer' two nodes are, the more likely they should have a connection. But if you only held connections to your 'closest' nodes then the network would be fragmented into clusters. So we need a mix of random connections and close ones.

CMCDragonkai commented 10 months ago

Continuing from the previous commentary, some discussions with @tegefaulkes indicates that:

  1. NCM manages connections and determinse whether to idle them out. This basically means we need a way of identifying which connections should be subject to the idle timeout, and which connections are not.
  2. NM is what understands the node graph structure in terms of semantics. When deciding how best to find the signaling node chains to the target node, this would be done here.

Now global connectivity means we need to at least have random connections but, we must have at least connections to the farthest bucket too.

This implies that random selection by itself shoulidn't be enough, we should have policy-directed randomness.

In order to do this, the NM and NG must provide a way to also find the farthest bucket.

So instead of only finding the closest nodes, we also want to find the farthest nodes and a random spread of other nodes to maintain active connections to.

From NG's perspective, finding farthest nodes or selecting random nodes in the node graph will be easy. The NG buckets are just prefix indexed, and we can always select nodes from the farthest bucket.

The seed nodes will receive connections from all over the Node ID space, so they are likely to have a fairly uniform split because they always receive contact first.

For the NCM to get the farthest bucket, it's just a matter of calling:

nodeGraph.getBucket(255);

As a side note I noticed that the NG has the KeyRing dependency injected. This works because the NG can ask for the own Node ID for certain operations. But it seems a bit tight coupling atm. We talked about this earlier, but we might consider a push-flow abstraction. Only issue is push-flow won't be consistent, whereas we might want the whole node to be updated in lock-step when the Node ID changes. I think however a reactive property might work better. Bascially NG only cares about the current node ID and has to react to changes to the current node ID.

There's also another method called NG.getClosestNodes. This allows you pass in a node ID, which will then try to find the closest node records to that node ID in the local node graph.

Now the issue is that, using nodeGraph.getBucket(255) will just get you the nodes in the farthest bucket to the current node's own node ID. If new nodes enter the network and are asking the seed nodes what their farthest nodes are, this won't work, we need something like NG.getClosestNodes or perhaps NG.getFarthestNodes.

I'm not sure how that would work in relation to random selections across the network. If we always connect closest and farthest, then why do we need random selection in between?

This is the first step at the very least is to be able to query the NG in more sophisticated ways. Subsequently we need to expose this information over the network. Then I am still confused as to whether kademlia will even help us find the intermediate signaling node necessary. Since closest won't help, how would we know which nodes that are closest or farthest away can act as a signaling node?

Maybe another selection strategy is required? If closest causes disconnected clustering, why not always pick the farthest? And use that as our signaling system?

According to https://github.com/MatrixAI/Polykey/pull/372#issuecomment-1097766562, the bottom K strategy did work but wasn't as efficient as random K at maintaining connectivity with minimum number of active connections.

I think there may be a trade-off between the probability of finding a sufficient signaling chain to reach a target node and the number of active connections to be maintained for full connectivity.

Because if everything was connected to everything, then any path may lead me to the target. But if there's fewer active connections, then there are less paths to my target, and thus a better way of identifying which path to take is important.

CMCDragonkai commented 10 months ago

Some additional notes.

The kademlia XOR property is subject to triangle inequality.

image

image

https://stackoverflow.com/a/25756549

This means:

If A to B is far away.

A to B would be shorter than or equal to the distance of A to C plus C to B.

So if C is close to B, then it would also have to be far away from A too. Because the sum of the 2 distances would have to be greater or equal to A to B.

In other words:

If B is the target node, and C is a close node to B, and B is far away from A, then C is also far away from A, potentially even farther away actually.

CMCDragonkai commented 10 months ago

This triangle inequality property is quite important to giving a geometric intuition to how this thing will work.

If B is the target, then we have to search backwards.

What we are searching for first is C. C is the node that has an active connection to B.

And if we are preferring active connections to be selected from close nodes, then C would relatively be far from A as well.

Upon finding C, we could attempt a connection to that first - because we have to in order to initiate the signal phases.

Then of course in order to connect to C, we have to find C', the intermediate node that we can use to help us connect to C.

Now there's a problem here, although none of this means that C is any closer to A, the originating node. Because the distance between A to C could be equal to A to B.

This kind of means, that if we keep trying to find C', and C'', each time we may just be finding another node that is just as far away, and potentially farther away in XOR distance to A.

Eventually we would exhaust everything that is close to B, C, C', C''... etc (assuming that the C is always the closest node to B). In fact, if we are asking kademlia what are the closest nodes to B, then it would likely give us a list of C, C', C''... etc.

We're not actually "closer" to getting the node that is close to us.

CMCDragonkai commented 10 months ago

Ok on another note, the procedure is roughly right now:

  1. Find a node ID
  2. Looks up locally and enters into a priority queue
  3. Each time we ask another node to find the same node ID
  4. This gives us IP + port combinations
  5. Which we then use ICE to connect which is hardcoded to contact all seed nodes atm

The modification to ICE, is to not simply always contact all seed nodes. Instead to do this more intelligently.

We could use provenance here.

One idea is that whoever knows about this Node ID and was able to tell you about the IP + port, can only have done so, if they had previously connected to that Node ID. Because records do not go into the node graph until a successful connection has been made.

Therefore one makes an observation that if a node does not even have the relevant record in its node graph, then it is impossible for them to be a valid signalling node in the context of hole punching.

Conversely, if a node has a relevant record, it could be a valid signalling node if it still has that active connection to that node.

The reason it could and not will is because:

  1. That node may no longer have a connection to the target Node ID.
  2. That node may have learned about the Node ID through the refresh bucket mechanism - which adds a layer of random gossip to kademlia, which forces nodes to share contacts, however contacts are only retained if a successful connection has been made.

So the 2 reasons are really just 1 reason. That 1 reason becomes more lopsided over time. That is the longer a node is running in the network, the more likely there will be nodes that had former connections to that node. (By how much though? The refresh bucket system seems like it wouldn't necessarily always pick a node ID that would exist).

As a side note, the termination condition of the find node mechanism is not verified:

image

It appears that we need to have a few conditions to ensure that we should not end up contacting the entire network. Right now it appears the refresh bucket mechanism would result in every node in the network end up slowly flooding the entire network by trying to connect to everybody. This needs to be prevented.

CMCDragonkai commented 10 months ago

Provenance may be relevant here, if the NodeGraph kept track of who told it about this NodeId information prior to connecting to it, this may help trace back to those who would have a persistent connection to the target node.

On the other hand, this may just always lead back our seed nodes, and that isn't particularly great either.

CMCDragonkai commented 10 months ago

To summarise the situation right now.

We have modify ICE, where it selects the node to do the signalling.

At that specific point in time we have a few choices to figure out here:

  1. Trying to find the closest nodes to the target node by itself won't really get you a chain of active connections by itself.
  2. But finding the closest nodes, is one of the ways in which you essentially get to know about the IP + port of a target node. Remember at this point we may not even know the IP + port of the target node.

Ok so here's an idea.

Let's make nodes maintain connections to close nodes, but also have random connections to far away nodes.

So the first step is to resolve the Node ID, that means doing a find node operation which leads us to close nodes, that are likely to have active connections to the target node.

Second step which is simultaneous to the first step, is to ask our own active connections (which some are close and some are far), what their active connections are.

So basically this lets us understand who is connected to us, and who they are connected to, as well as who is connected to the target node.

This search process will eventually lead to a bridge point where there is a node that is in the set intersection, and is the missing link that can bridge the 2 networks together.

There is a high probability that all nodes are only separated by 6 degrees.

CMCDragonkai commented 10 months ago

Actually the 6 degrees of separation isn't proved.

We still need to work out the probabilistic longest chain.

CMCDragonkai commented 10 months ago

So at 6 random connections, the small world network says that the number of intermediate hops is log_2(N)/log_2(K). That means at 60,000 nodes, it will take 6 hops.

CMCDragonkai commented 10 months ago

Going to try this implementation and see how it goes:

  1. During the find nodes procedure the return value will now contain both the closest nodes and active connections, this could be a separate list, a separate RPC call, or added as additional metadata to the close nodes.
  2. When you're trying to connect to any particular node, you have to look at your local node graph first anyway, and when doing an ICE to them, you have to try to do a direct connection, and if you need to do the signalling you have to rely on your active connections anyway. Without active connections, you would use the seed nodes.
  3. This means you now have a more complex multi-priority queue, where there are heuristics that can lead you to trying different things. You can try nodes that are close via direct connections, via active connections, or you can even try other nodes that farther away but you have a signalling available for it.

No matter what just connecting to nodes to do the find nodes (which is intended to just resolve the node ID to the IP + port) requires connecting to other nodes anyway. So ICE has to more options to try.

If you have 6 active connections, you can ask those connections what their active connections are - basically can they can help to connect to. You can then make connections to them, and ask them what their active connections are. This somewhat close a flooding operation. Where you are flooding through the active connection list. Actually is this a maze traversal problem? Kind of like the micromouse challenge?

It seems the idea is that the find nodes operation could be used along with flood fill of active connections.

CMCDragonkai commented 9 months ago

@tegefaulkes write up a summary comment about what has been done in #618, and how we may proceed to improve it.

Also we now begin empirical tests.

We need to create some EC2 machines and setup the NAT firewalls for all these machines, we can take code from the NAT simulations in PK CLI - which has been disabled, we can re-use that code to setup the nftables/iptables across these EC2 machines, we can do it at scale with Pulumi, and start stress testing the network. Observability can be improved with more events/metrics exposed to the dashboards #599.