Tribler / py-ipv8

Python implementation of Tribler's IPv8 p2p-networking layer
GNU Lesser General Public License v3.0
231 stars 47 forks source link

Optimize Network set lookups #935

Closed qstokkink closed 3 years ago

qstokkink commented 3 years ago

We have several instances of lookups by set hash in the Network class. These all look like this example:

https://github.com/Tribler/py-ipv8/blob/e060a37a0ddad5f3c061305bb56ebb107c494340/ipv8/peerdiscovery/network.py#L202-L212

Even though for-loops are fast, we can (and probably should) optimize these lookups. Note that set objects don't have a find() or get() method. I'll document the three fastest lookups I found here.


Each experiment uses 100000 repetitions with a set of 300 elements (all integers in [0, 300)). All code fragments operate of a given set (container), to find an object (search) that has a field (a, search.a = randint(0, 299) for each repetition) to match to the field of the set elements, if no match is found then the result (found) should be None.

for loop (current)

found = None
for o in container:
    if o.a == search.a:
        found = o
        break
Total time1.6288s
Time per search16.288μs
Speedup vs. for-loop1.0000

set operators

found = container - (container ^ {search})
found = found.pop() if found else None
Total time0.7954s
Time per search7.954μs
Speedup vs. for-loop2.0478

Note: {search} & container or container & {search} will not work. Python will give you the element from the smallest set back - our "fake" search object.

map hijack

def find(x):
    if x == search:
        find.out = x
find.out = None
map(find, container)
found = find.out
Total time0.0517s
Time per search0.517μs
Speedup vs. for-loop31.5023

So far, the "map hijack" is the fastest. If you have a faster lookup method, please let me know 😃

kozlovsky commented 3 years ago

If I understand the task correctly, nothing in Python should beat the O(1) speed of a dict lookup:

class Network(object):
    def __init__(self):
        self.verifyed_peers = set()
        self.peers_by_key = dict()
        self.peers_by_address = dict()
        self.peers_by_service = defaultdict(set)
        ...

    def add_verified_peer(self, peer):
        with self.graph_lock:
            ...
            self.verified_peers.add(peer)
            self.peers_by_key[peer.public_key.key_to_bin()] = peer
            for address in peer.addresses:
                self.peers_by_address[address] = peer
            for service_id in peer.services:
                self.peers_by_service[service_id].add(peer)

    def remove_peer(self, peer):
        with self.graph_lock:
            for service_id in peer.services:
                self.peers_by_service[service_id].discard(peer)
            for address in peer.addresses:
                self.peers_by_address.pop(address)
            self.peers_by_key.pop(peer.public_key.key_to_bin())
            self.peers.discard(peer)

    def get_verifyed_by_public_key_bin(self, public_key_bin):
        with self.graph_lock:
            return self.peers_by_key.get(public_key_bin)

    def get_verifyed_by_address(self, address):
        with self.graph_lock:
            return self.peers_by_address.get(address)

    def get_peers_for_service(self, service_id):
        with self.graph_lock:
            return self.peers_by_serivce.get(service_id).copy()
qstokkink commented 3 years ago

@kozlovsky I must disqualify that answer as you as you deviated from the question by changing the set datastructure. But, in the interest of science, I changed the container to a dict. The results may be surprising (clickbait intended).

Here's your report:


dict lookup

Cheating notice: container is now a dict instead of a set. Answer does not qualify for the Snickers bar.

found = container[search]
Total time0.0577s
Time per search0.577μs
Speedup vs. for-loop24.9324

This surprised me. So, I ran this a few times, but the "map hijack" from OP is still faster1 than the dict lookup for the same input.

[1] There is some variance. There are runs where speed is comparable and there are runs when "map hijack" is more than 3 times faster than a dict lookup. Overall, the fastest result for the map was 0.141μs and the fastest for the dict lookup is 0.181μs, the slowest lookup for the map was 0.602μs and the slowest dict lookup was 0.701μs.