ethersphere / swarm

Swarm: Censorship resistant storage and communication infrastructure for a truly sovereign digital society
https://swarm.ethereum.org/
GNU Lesser General Public License v3.0
489 stars 112 forks source link

Kademlia Upgrade #1535

Open nonsense opened 5 years ago

nonsense commented 5 years ago

I think we should discuss how/when we are going to tackle a refactor of Kademlia. We already have >5 known issues that we want to address, that the current implementation is not supporting:

  1. Kademlia suggests peers when a node retrieves remote chunks without knowledge of how utilised those peers are, resulting in not-optimal recommendations if we have a set of N equally distant peers from a given chunk - https://github.com/ethersphere/swarm/issues/1533

  2. The number of connections per bin is not adequate - sometimes it is 2 , sometimes it is 20 - we need to come up with a way to have a more deterministic way to build up a Kademlia table - https://github.com/ethersphere/swarm/issues/1436

  3. Currently a node pull syncs with all peers in a given bin. With push sync we might want to disable syncing of all bins < depth (or not?) and only sync with all our peers within our depth (>= depth). If we still decide to keep pull sync on lower bins (0, 1, etc. < depth), we should definitely not sync with all our peers within bin 0, but only a few. Basically there needs to be a distinction of peers - some peers should be available for retrieve requests, some peers should be available for syncing, and this should be more explicit. Right now in the Kademlia impl. we have a single container with conns, so we should think how we want to design this.

  4. Light nodes - they need to have connections with other peers and have a Kademlia table so that they issue properly retrieve requests, but ideally they should not appear in the Kademlia table of full nodes as we don't want Kademlia to suggest them for syncing, or other caps that they don't have. However it makes sense for Light nodes to share their view of the network with Full nodes, so it seems like there is benefit for them to run partly the hive protocol?

  5. Kademlia connectivity state saving and restoring - should be more deterministic - https://github.com/ethersphere/swarm/issues/1396 . If we restart our node, we should prefer nodes that we were recently connected to, so that we don't incur syncing costs... (FYI our smoke tests suffer from this if you just restart a deployment and nodes connect to new peers and start historic syncing).

  6. Visibility over Kademlia (some connections and known peers are hidden) and usage of peers can be improved - https://github.com/ethersphere/swarm/issues/1403 - currently we don't have a good dashboard ala torrent client, where we can see how many chunks we have sent/received from a peer and how many are in flight. It'd be nice to have this so that we increase throughput of Swarm in general.

  7. Move loading and storing of Kademlia known and connected peers outside of the hive protocol?

I suggest we discuss these soon and decide how and when to tackle them.

nolash commented 5 years ago

Pertaining to item 1 https://github.com/ethersphere/SWIPs/pull/28

nolash commented 5 years ago

3 ... If we still decide to keep pull sync on lower bins (0, 1, etc. < depth), we should definitely not sync with all our peers within bin 0, but only a few.

@nonsense is it an idea to keep the current saturation metrics for syncing? That means; Pull Sync with 2 peers in each bin.

Who to select (or replace) should be up to a component dedicated to analyze connection stability. We should probably rely on the original kademlia paper claim that the longest connected peers tend to be more stable also in the future.

Since Push Sync seems to have certain weaknesses in delivery guarantees, I believe keeping pull sync provides important redundancy.

nolash commented 4 years ago

Update 25.10.2019:

  1. Resolved by https://github.com/ethersphere/swarm/pull/1774
  2. Partially addressed https://github.com/ethersphere/swarm/pull/1833 https://github.com/ethersphere/swarm/pull/1869
  3. Not yet addressed
  4. Not yet addressed
  5. Resolved by https://github.com/ethersphere/swarm/pull/1844
  6. Not yet addressed
  7. Not yet addressed