Closed CMCDragonkai closed 2 years ago
The key difference is that our implementation is flat, while k-bucket
is a binary tree form.
This question/answer explains the difference between different kademlia implementations and what kind of tradeoffs there here:
Further reading and based on the experiment https://github.com/MatrixAI/js-polykey/pull/326#issuecomment-1065905225 have revealed to me a different bucket indexing order.
So far we've been using bucket 0 as the closest bucket, and bucket 255 as the farthest bucket where 0 would contain only 1 node ID, and 255 would contain 50% of all possible buckets.
However in the kademlia theory, and discussions, the bucket indexing seems to be in reverse instead. This is because of the trie structure used to explain the more advanced kademlia implementations. The trie is also known as a prefix tree, and in this case the prefixes are the bit-prefixes of the node IDs.
Which means the literature says that bucket 0 is actually the farthest bucket, and bucket 255 would be the closest bucket. The bucket index then has a direct relationship with how many bits are shared between all the node IDs in the bucket.
Consider this image:
The largest bucket is the first bucket on the left. You can say this is bucket 1 using 1-based indexing. In this bucket, ALL node IDs must share 1 bit prefix which is 0
.
The first (1th) bucket can have:
000
001
010
011 - not shown above because 011 wasn't participating in the network yet
And you can see that they all share the 0
bit.
The second (2) bucket can have:
100
101
And in this case it must share the 2 bits of the prefix 10_
.
Finally the last bucket shares all 3 bits:
111
And thus can only 1 node ID in it. The own-node ID is not put in to the bucket, but is shown on the graph above to indicate its bitwise/numeric position among all node IDs.
This means the bucket index (1-based indexing) is actually similar to a CIDR notation used in IP subnets. Where in this case /3
is the last and closest bucket, and therefore allows no-variability in the NodeID
, whereas in the first case /1
fixes the first bit.
One way to notate a bucket is therefore 000/1
, 100/2
and finally 111/3
. A /0
would mean all possible node IDs.
This all makes sense and it means it should be possible to store k-buckets using a binary prefix tree. And no-matter how large the bit size of node IDs are as long as they are >1, the total number of possible node IDs is always an even number.
The first-bit therefore determines the first halving of the keyspace. For the 3-bit size node ID, we have 8 possible node IDs, and the first 0
prefix contains underneath itself half of all node IDs.
0
/ \
0 1
/ \ / \
0 1 0 1
000 001 010 011
It also turns out that doing this ensures that buckets always cover contiguous node IDs. In reality this is not the case because not all node IDs will be participating in a network.
In the wiki page it says:
Nodes that can go in the nth list must have a differing nth bit from the node's ID; the first n-1 bits of the candidate ID must match those of the node's ID.
This only makes sense if nth
means 1-based indexing, starting as 1 being the farthest and largest bucket. So the 1th bucket, has the 1th bit different, and the n-1
bits is the shared prefix. I still think this is a bit confusing to be honest, because 1 - 1 = 0
which could mean 0 bits shared prefix, or more correctly the shared prefix up to and including the 0th bit. The latter appears to the case.
For the answer https://stackoverflow.com/a/51161732/582917:
When you say
ith
bucket, if this is 0-based indexing, then the shared prefix is to up to and including theith
bit. If this is 1-based indexing, then the shared prefix is lengthi
from the start.
For our implementation this doesn't really affect it, whether we say that 255 bucket is the largest or 0 bucket is the largest. However if we want to make our impl match the spec, it's probably a good idea to reverse our bucket indexing, however because of lexicographic indexing, that would mean the farthest nodes are going to be at the top of the level DB, whereas right now the closest nodes are at the top. Level DB docs say that reverse: true
may be slower than normal order. So that would mean implementation-wise we may keep it in the bucket indexing we already have, instead of following the spec to the letter. This may be moot once we change to a trie-based structure though.
In leveldb because keys are stored lexicographically and it's possible to create sublevels arbitrarily, it seems that instead of storing buckets using the lexi packed bucket index integer, one could instead just store the bitwise-prefix of the NodeId
.
In fact, given that NodeId
prefixes would be the buckets. One can lookup all the node IDs in a particular bucket by calculating the prefix, and then doing a createReadStream
with lt
and gt
set over the prefix range (you would use specific node IDs in the lt
and gt
range settings).
Because we are talking about bit-level prefixes, sublevels won't come into play here because sublevels atm can only use strings, and even if it took bytes, it would only work byte-by-byte, you couldn't get a sublevel of bit-prefixes, only byte-prefixes.
But lt
and gt
should work because of lexicographic order of the node ID bytes is equal to the big-endian based Buffer.compare
sort.
I'm looking into https://github.com/tristanls/k-bucket and comparing http://www.scs.stanford.edu/~dm/home/papers/kpos.pdf preprint with the longer paper https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf.
The longer paper goes into more detail regarding bucket splitting, and it appears to be implemented by https://github.com/tristanls/k-bucket. Reading through the readme, it demonstrates that it's an in-memory data structure.
It's a growing binary tree, where the tree paths are the bit-prefixes and thus represents a bitwise trie.
It grows because it first starts out with only just 1 bucket. Nodes are entered into this bucket until it hits the bucket limit. That bucket limit is still 20. When this occurs, the bucket is split into 2-subbuckets. One of the buckets is the near bucket, and other is the far bucket. The bucket split depends on the prefix bit difference. The first split works by moving all node IDs with 0 as the first bit into one of the buckets, and all other node IDs with 1 as the first bit into the other bucket. Which bucket depends on which bucket is considered far or near depending on the own-node ID. Subsequently new node IDs may be entered into the far bucket or the near bucket. But the far bucket is no longer split, instead it operates on the recycling policy of old node IDs, and there is a parameter of 3 nodes to be contacted before deciding to remove old node IDs. For the near bucket, it continues receiving new node IDs, but once it hits 20, it splits again. Each time it splits again, the splitting behaviour based on the bit-prefix increases. So now it splits on the second bit.
The splits continue until the number of buckets equals the bit-length of the own-node ID.
This implementation cannot work inside leveldb because leveldb is a flat structure where we store key-values in. However as explained above, the lexicographic encoding seems to imply we can just store node IDs directly and rely on the range querying such as the lt
and gt
settings to acquire bucket ranges of node IDs.
The concepts of bucket splitting are even unnecessary for us.
According to the longer paper, it is the bucket who's range covers the own-node ID which gets split. Which means the far-bucket is the bucket not containing the own-node ID and the near-bucket is the bucket containing the own-node ID.
In the k-bucket trie implementation, the closest
method does in fact sort the acquired node IDs by distance at the end: https://github.com/tristanls/k-bucket/blob/3aa5b4f1dacb835752995a25409ab319d2070b9e/index.js#L223-L227
return contacts
.map(a => [this.distance(a.id, id), a])
.sort((a, b) => a[0] - b[0])
.slice(0, n)
.map(a => a[1])
This means even after getting all the nodes in the relevant bucket, it still has to do a distance sort. This is a similar result to what happened in our experiment https://github.com/MatrixAI/js-polykey/issues/212#issuecomment-1065905417 which means we would need to sort by distance at the very end anyway before returning. The trie prefix keys is not enough to determine actual distance sort.
At the same time we did index the lastUpdated
, but this is would not suitable for distance since the closest node queries are based on any node ID, not just the own node ID, so there is not one single distance indexing that can be used.
The performance is not that bad because buckets have fixed sizes even in the trie implementation.
Attempting to implement the trie into the leveldb hits a problem.
The basic idea is that instead of putting the bucket index into the leveldb as a sublevel, we just store the entire NodeId and rely on the prefix lookup to find buckets.
Now to look up all the nodes in the bucket, you have to do a prefix search. This means using the gte
and lt
settings.
An algorithm needs to turn own NodeId
and the desired bucket index, into the gte
and lte
setting.
Now according to the distance equation:
2^i <= D < 2^(i+1)
This means the smallest distance that could exist for a node in that bucket would be 2^i
.
And because XOR is commutative, you can do:
own NodeID XOR 2^i = NodeId that has the lowest distance in the bucket
You would also need to take |i - 256|
number of shared bits. Append pad with 0
bits to get the node ID in lowest lexicographic range of the bucket. This would give us the GTE NodeId
.
For the LTE NodeId
, you have to get the max distance with 2^i+1
and do the same, and you have LTE
NodeId, but this time append pad with 1
bits.
Now this is getting quite complicated, it seems not worth it. The existing implementation which just stored the bucket indexes directly in the sublevel key seems much simpler. This system wouldn't change the fact we still need the index sublevel to index by the last updated time, it just makes the resetBuckets
function simpler because you don't have to remap the bucket indexes for the buckets DB, but because the index DB still needs the bucket index, not much complexity is saved.
At any case this exercise was interesting for understanding the structure.
And therefore this issue can actually be closed, because we have no need of doing any kind of bucket splitting, since we don't really have a tree structure at all.
What we learned here is that we do need to sort the bucket distance at the very end. Neither flat-list nor trie indexing ensures that NodeIds
can be fetched indexed by distance, and it couldn't work even with a SQL database, because distance is not a fixed property, it depends on what the "source node ID" is, and our source node Id may be different depending on who's asking.
This means for #212 @tegefaulkes, you must sort the buckets every time you are acquiring all the nodes in them before buffering them into the closest list. In k-buckets, they just do a full concatenation and then slice/truncation on each step. But it can be done a bit more efficiently I reckon.
I think your conclusion about bucket 0 being the farthest is incorrect. Note that in the diagram the numbers at the bottom are the IDs of each node, not the distance. Also note that the diagram show bucketing relative to the ID 110
. SO the smallest bucket, bucket 0, contains the node 111
this has the distance 001
which aligns with our current implemetation. note that the largest bucket on the left contain IDs that have the distance of 1XX
.
Quote reply didn't work properly. I'm referring to this comment. https://github.com/MatrixAI/js-polykey/issues/158#issuecomment-1066035434
I think your conclusion about bucket 0 being the farthest is incorrect. Note that in the diagram the numbers at the bottom are the IDs of each node, not the distance. Also note that the diagram show bucketing relative to the ID
110
. SO the smallest bucket, bucket 0, contains the node111
this has the distance001
which aligns with our current implemetation. note that the largest bucket on the left contain IDs that have the distance of1XX
.Quote reply didn't work properly. I'm referring to this comment. #158 (comment)
Doesn't impact the implementation as is. I concluded that trie structure is irrelevant to us, and we have no need for the more advanced kademlia details. We can proceed with our NodeGraph
design as it is in the branch testnetprep
.
Since this is irrelevant now I think we should close this issue and/or remove it from #378.
It can be auto-closed when #378 merges.
This was part of our package.json - https://github.com/tristanls/k-bucket.
It may not be used due our kademlia being persistent. But it may be a valuable source to understand how to make our own kademlia implementation more performant.
For now I'm removing k-bucket from package.json.