An alternative algorithm for calculating torrent popularity

Right now, we have an implemented algorithm for gossiping torrent popularity and want to:

evaluate its results;
get an idea of the statistics of Tribler usage "in the wild" based on these results.

In my opinion, it is not an easy task to do. We use this algorithm as a telescope to look at the world around us through it and have no idea if this telescope is working or if it is broken:

The algorithm may have conceptual problems, for example, the slow speed of information propagation, due to which the results of its operation may not show the real state of the network;
Also, the implementation of the algorithm may contain bugs that distort the results.

As a possible solution, I propose to implement an alternative algorithm that uses a different strategy, and then compare the results of two algorithms.

it is not about the new algorithm being better. The point is that by gaining popularity in two independent ways, we can see a more objective picture. As @synctext said, the researcher's community have worked on these ideas for tens of years. So it is unlikely I can come up with a better algorithm, which is robust, scalable, bias-free, and also protected from possible attacks. But I just want to have a second telescope to estimate the results of the first one.

In the following explanation, I will call the current algorithm as PC1 (Popularity Community 1), and the new proposed algorithm as PC2 (Popularity Community 2).

First, a few words about the current algorithm, PC1. Looking into its implementation, I noticed the following:

PC1 relies heavily on the seeders count received from torrent trackers. In my understanding, we want Tribler to be independent of any external services and self-sufficient, so it may be wrong to build popularity results on the info received from trackers as the primary source of data.
The second source of information is the BEP33 data. It is better, but it is the result of some external algorithm, implemented by a libtorrent plugin. We probably don't know how reliable the data is, and so cannot use it as the main source of data as well.
The third source of data is the results obtained from DHT. If I understand correctly, it may report some subset of seeders and not the full count.

And now, to my main observation: the current algorithm cannot combine data obtained from the different sources. If we have two trackers specified for the same torrent, and these trackers return the different number of seeders (like 1000 and 1500), we cannot sum these numbers up, as some seeders may connect to both torrents, and we will count them twice. The same is with the results from other sources.

So, the current algorithm, when getting results from the different data sources, just takes max of the received numbers. And when these number has gossiped to another peer, it does not combine with the previous data that peer has but replace them entirely.

The alternative algorithm I want to propose works in a totally different way:

it counts the real number of unique torrent's seeders to whom we (or some other node) were able to connect to. This number is not a total swarm size but represents real peers and, in that sense, looks more reliable.
it can combine unique seeders connected to different peers. So, when combining these numbers, we can estimate the total swarm size pretty well.
when all Tribler nodes for any reason hesitate to connect to obtain real seeders for some torrent, this torrent will not appear in the popularity result list (which is probably a good thing?), even if the torrent's tracker reports some number of seeders.
using the gathered data, it is possible to count a total number of peers that seed at least one torrent and use this number as an estimate of the overall system liveness.
the algorithm has inherent protection from one type of attack: for some torrent, an attacker can't propagate zero number of seeders, as the numbers of unique seeders obtained from different nodes sum up, and it is not possible to decrease that counter's value.

To implement this algorithm, we can use the HyperLogLog counter of seeders for each infohash (Initially, I wanted to use bloom filters, but @ichorid suggested using HyperLogLog counters, and I really liked that idea). The HyperLogLog counter has the following characteristics:

it can count distinct values. If we put the same hashed seeder id into it several times, the counter's value will not increase;
while HyperLogLog does not keep the exact number of unique elements, it can calculate a probabilistic approximation of the total element's count with high precision;
it is possible to merge two HyperLogLog counters, and so we can combine the number of unique seeders obtained from different nodes, as well as the total number of unique seeders across all torrents;
it is pretty compact (much smaller than a bloom filter with a similar capacity and precision).

Now to the (simplified) description of the PC2 algorithm:

Each Tribler node with PC2 community stores in memory the dictionary of the following form: {infohash: seeders_counter}, where seeders_counter is the HyperLogLog counter of unique seeders for that infohash value. In this dictionary, we store only infohashes for which the numbers of seeders is greater than zero, so no dead torrents here;
Also, there is a helper structure to find random infohash from this dictionary efficiently.
When our node has connections to some torrent seeders, it regularly put seeder's hashes into the HyperLogLog counter of that torrent. As only distinct values are counted, the counter value will grow only when new seeders appear.
At regular intervals, the algorithm selects two random infohashes from the dictionary and decides which counter value to propagate to the neighbor peers. In most cases (like 90%), it chooses a counter whose estimated value is bigger, but with some small probability (10%), it selects another counter. After that, the node sends this pair of (infohash, serialized_seeder_counter) to one random peer.
the peer, after receiving this data, merges the received counter with the previous counter it had for the specified infohash value. Or, in case of a new infohash, creates a new entry with received HyperLogLog counter value.

Following this algorithm, eventually, each Tribler node will contain a total count of unique seeders for each live torrent. But, as a HyperLogLog counter cannot decrease, we need to clear the obsolete information somehow. For doing this, we can split data into time buckets (for example, 1-hour intervals), and gather information for each time interval separately. In this case, the data structure will look as {time_interval: {infohash: seeders_counter}}, and the packet transferred to peers will contain time_interval as well. For the user in the GUI, we can display an interpolated value between previous and current time intervals.

When the node shuts down for an extended interval (say, for a week), we want to display the popularity information to the user after he relaunches the node. For doing this, before shutting the system down, we can store estimated seeders count into the TorrentState entity and use it as a previous value for interpolation when the node launches again.

Now about possible attacks. There are two types of attacks that look especially relevant here. The first is the attack when some node wants to suppress the spreading information about some popular torrent. As I already said, if there is at least one uncompromised path between good nodes, the evil node will not be able to suppress information spread, as the same information will be translated through a different path.

The second attack, when some node wants to inflate seeders count for some torrent, is harder to prevent using the proposed algorithm (but it is a more seldom type of attack, as it seems). It is possible to "pollute" counter with a big number of random hashes, to give the impression that there are a lot of seeders. As a counter-measure, it is possible to add the second HyperLogLog counter, which counts "disappointment" of the node, which attempts to connect to a supposedly popular torrent only to see the low number of seeders. Then we can have some heuristic about what to do with suspicious torrents.

But as I said, right now, my proposal is to have the second popularity algorithm, PC2, just to estimate the data obtained from the current algorithm PC1. It is not necessary to display PC2 results to end-users, so at this moment, nobody will have an incentive to attack it.

So what do you think, guys, is implementing an alternative algorithm a good way to understand the results we have from the current algorithm?

Experimentally evaluating different ideas, I fully support.

One note on the networking side though: we use a single socket for all packets. Speeding up the information propagation in any network overlay is almost trivial by simply sending more messages. However, the more packets you send and receive in one network overlay, the slower the others become. Practically speaking, the more popularity information you send, the slower your anonymous downloads will be. Therefore, my tip: look carefully at your message complexity and throughput.

Looks good. It will be a pain in the :butterfly: to integrate the data from all three types of sources (DHT, tracker, BEP33) because of the asynchronous concurrency hell. See Happy Eyeballs and Trio presentation.

I like the idea of implementing two (or more) alternative algorithms and comparing them later.

I see Tribler as the experimental network, so that means no problems with traffic sharing among "concurent" components.

@kozlovsky 👍

We use this algorithm as a telescope to look at the world around us through it and have no idea if this telescope is working or if it is broken

Great comment and solid in-depth analysis. My speed-reading yesterday did surely not do it justice, sorry for that. You have broken the record of swapping in our algorithms, normally it takes numerous weeks before new developers to the team grasp our complexities.

Everything we do and decide should be driven by data captured from our live network, our "Data-Driven Mehtodology". As proposed yesterday, lets spend 1 week (for now) understanding the quality of our measurement data (tracker,BEP33,DHT). Could be a classical “garbage in, garbage out” problem. We need to establish the ground truth by joining swarms with Libtorrent. Also, we made a design decision to abstract away multiple source and even the source of information itself. That could have been a mistake.

My advise would be probably make a non-compatible PC2 where the source info and every tracker is explicitly communicated. But lets first measure to see whats happening, OK?

We approached this too informally and never specified the requirements for PopularityCommunity. Something like for first versions. For PC5 we could go with 98% accuracy for 100k of the largest swarms.

think about attacks, but trustchain and reputation functions are not ready yet for integration
Information coverage: capture 98% of all popular swarms (beyond 100 seeders)
Information quality:
- swarm size can be outdated and 2 weeks old
- end-user only cares if this is going to be a decent speedy download or "shi*ty" slow
- 0 seeders, dead swarm, show to user as :skull: :skull_and_crossbones: :skull: swarm
- 1 or 2 seeders, zombie swarm, half dead :zombie:
- 3 or 5 seeders; this is going to take a day at least :sun_behind_rain_cloud:
- 6 or 20 seeders; expect medium download speed :sun_behind_small_cloud:
- 20 or more seeders, top download speeds :sun_with_face:

With 2 improvement iterations (could be 2 x 2 weeks for PC2 and PC3) we hopefully have solid confidence in our deployed code. After that we can make optimizations like HyperLogLog and the 1984 Flajolet–Martin algorithm for PC4. Lets not do premature optimisations and first fix the obvious improvements/mistakes in PC1. So please think what you want for PC2, after measuring and comparing gossip data to ground truth (e.g. joining swarm). We could deploy that in a new Tribler release in 2 weeks.

Your proposed design is very suitable for PC4!

This looks like a good plan. So, as the first step, we can do the following experiment:

gather popularity community data during some specified time (say, an hour).
For the next hour, check a random sample of the gathered torrents with an apparent number of seeders greater than zero, and gather the actual numbers returned from the torrent trackers, BEP33, and DHT queries
for each of three data sources, make a plot, where each torrent is displayed as a point on the graph where X is the swarm size from the popularity community, and Y is the swarm size checked locally.

This way, it should be possible to see how data from the popularity community corresponds with actual data from the network. We will be able to see if the speed of information spreading through the current version of the popularity community is good enough, or the data received from it are mostly obsolete. We would be able to see if the checked torrent is actually in the same category (☠️, 🧟, 🌦️, 🌤️, 🌞) as it was gossiped to us by the popularity community.

But these experiments will not show to us: are there any torrents that are actually very popular but don't spread through the popularity community for some reason.

In PC2, we can keep separate values gathered from the different data sources and provide some way to remotely influence the frequency of the information transmission and weights, which affect the probability with which the algorithm decides which torrent info should be transferred to peers. This way, it would be possible to experiment with different algorithm parameters without releasing a new version of Tribler each time a new experiment is performed.

Indeed, this is a very good idea to go "bottom-up" and check individual stats received by different methods, and aggregate the data later. It is much better aligned with the decentralization philosophy. Also, in this way we can truly parallelize data gathering.

Regarding the "very popluar but not shown" torrents, these are "unknown unknowns" and I would not worry about them. Catching these will require passive listening to DHT traffic, akin to what some "download-revealing" websites do.

For PC2, I don't think we should be able to influence how other nodes calculate their PC stats, because it is too dangerous and easy to exploit. Instead, we can implement something like I did with RemoteQueryCommunity: a generic PC-related query to get popularity info from users' DBs. This will allow us to crawl the network, get the base stats, and then model different ways to calculate the popularity rating locally, in the lab.

This suggestion has led to the creation of #7586 and further discussion (if needed) can take place there.

Tribler / tribler

An alternative algorithm for calculating torrent popularity #5589