Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.85k stars 451 forks source link

An alternative algorithm for calculating torrent popularity #5589

Closed kozlovsky closed 2 months ago

kozlovsky commented 4 years ago

Right now, we have an implemented algorithm for gossiping torrent popularity and want to:

In my opinion, it is not an easy task to do. We use this algorithm as a telescope to look at the world around us through it and have no idea if this telescope is working or if it is broken:

As a possible solution, I propose to implement an alternative algorithm that uses a different strategy, and then compare the results of two algorithms.

In the following explanation, I will call the current algorithm as PC1 (Popularity Community 1), and the new proposed algorithm as PC2 (Popularity Community 2).

First, a few words about the current algorithm, PC1. Looking into its implementation, I noticed the following:

And now, to my main observation: the current algorithm cannot combine data obtained from the different sources. If we have two trackers specified for the same torrent, and these trackers return the different number of seeders (like 1000 and 1500), we cannot sum these numbers up, as some seeders may connect to both torrents, and we will count them twice. The same is with the results from other sources.

So, the current algorithm, when getting results from the different data sources, just takes max of the received numbers. And when these number has gossiped to another peer, it does not combine with the previous data that peer has but replace them entirely.

The alternative algorithm I want to propose works in a totally different way:

To implement this algorithm, we can use the HyperLogLog counter of seeders for each infohash (Initially, I wanted to use bloom filters, but @ichorid suggested using HyperLogLog counters, and I really liked that idea). The HyperLogLog counter has the following characteristics:

Now to the (simplified) description of the PC2 algorithm:

Following this algorithm, eventually, each Tribler node will contain a total count of unique seeders for each live torrent. But, as a HyperLogLog counter cannot decrease, we need to clear the obsolete information somehow. For doing this, we can split data into time buckets (for example, 1-hour intervals), and gather information for each time interval separately. In this case, the data structure will look as {time_interval: {infohash: seeders_counter}}, and the packet transferred to peers will contain time_interval as well. For the user in the GUI, we can display an interpolated value between previous and current time intervals.

When the node shuts down for an extended interval (say, for a week), we want to display the popularity information to the user after he relaunches the node. For doing this, before shutting the system down, we can store estimated seeders count into the TorrentState entity and use it as a previous value for interpolation when the node launches again.

Now about possible attacks. There are two types of attacks that look especially relevant here. The first is the attack when some node wants to suppress the spreading information about some popular torrent. As I already said, if there is at least one uncompromised path between good nodes, the evil node will not be able to suppress information spread, as the same information will be translated through a different path.

The second attack, when some node wants to inflate seeders count for some torrent, is harder to prevent using the proposed algorithm (but it is a more seldom type of attack, as it seems). It is possible to "pollute" counter with a big number of random hashes, to give the impression that there are a lot of seeders. As a counter-measure, it is possible to add the second HyperLogLog counter, which counts "disappointment" of the node, which attempts to connect to a supposedly popular torrent only to see the low number of seeders. Then we can have some heuristic about what to do with suspicious torrents.

But as I said, right now, my proposal is to have the second popularity algorithm, PC2, just to estimate the data obtained from the current algorithm PC1. It is not necessary to display PC2 results to end-users, so at this moment, nobody will have an incentive to attack it.

So what do you think, guys, is implementing an alternative algorithm a good way to understand the results we have from the current algorithm?

qstokkink commented 4 years ago

Experimentally evaluating different ideas, I fully support.

One note on the networking side though: we use a single socket for all packets. Speeding up the information propagation in any network overlay is almost trivial by simply sending more messages. However, the more packets you send and receive in one network overlay, the slower the others become. Practically speaking, the more popularity information you send, the slower your anonymous downloads will be. Therefore, my tip: look carefully at your message complexity and throughput.

ichorid commented 4 years ago

Looks good. It will be a pain in the :butterfly: to integrate the data from all three types of sources (DHT, tracker, BEP33) because of the asynchronous concurrency hell. See Happy Eyeballs and Trio presentation.

drew2a commented 4 years ago

I like the idea of implementing two (or more) alternative algorithms and comparing them later.

I see Tribler as the experimental network, so that means no problems with traffic sharing among "concurent" components.

@kozlovsky 👍

synctext commented 4 years ago

We use this algorithm as a telescope to look at the world around us through it and have no idea if this telescope is working or if it is broken

Great comment and solid in-depth analysis. My speed-reading yesterday did surely not do it justice, sorry for that. You have broken the record of swapping in our algorithms, normally it takes numerous weeks before new developers to the team grasp our complexities.

Everything we do and decide should be driven by data captured from our live network, our "Data-Driven Mehtodology". As proposed yesterday, lets spend 1 week (for now) understanding the quality of our measurement data (tracker,BEP33,DHT). Could be a classical “garbage in, garbage out” problem. We need to establish the ground truth by joining swarms with Libtorrent. Also, we made a design decision to abstract away multiple source and even the source of information itself. That could have been a mistake.

My advise would be probably make a non-compatible PC2 where the source info and every tracker is explicitly communicated. But lets first measure to see whats happening, OK?

We approached this too informally and never specified the requirements for PopularityCommunity. Something like for first versions. For PC5 we could go with 98% accuracy for 100k of the largest swarms.

With 2 improvement iterations (could be 2 x 2 weeks for PC2 and PC3) we hopefully have solid confidence in our deployed code. After that we can make optimizations like HyperLogLog and the 1984 Flajolet–Martin algorithm for PC4. Lets not do premature optimisations and first fix the obvious improvements/mistakes in PC1. So please think what you want for PC2, after measuring and comparing gossip data to ground truth (e.g. joining swarm). We could deploy that in a new Tribler release in 2 weeks.

Your proposed design is very suitable for PC4!

kozlovsky commented 4 years ago

This looks like a good plan. So, as the first step, we can do the following experiment:

  1. gather popularity community data during some specified time (say, an hour).
  2. For the next hour, check a random sample of the gathered torrents with an apparent number of seeders greater than zero, and gather the actual numbers returned from the torrent trackers, BEP33, and DHT queries
  3. for each of three data sources, make a plot, where each torrent is displayed as a point on the graph where X is the swarm size from the popularity community, and Y is the swarm size checked locally.

This way, it should be possible to see how data from the popularity community corresponds with actual data from the network. We will be able to see if the speed of information spreading through the current version of the popularity community is good enough, or the data received from it are mostly obsolete. We would be able to see if the checked torrent is actually in the same category (☠️, 🧟, 🌦️, 🌤️, 🌞) as it was gossiped to us by the popularity community.

But these experiments will not show to us: are there any torrents that are actually very popular but don't spread through the popularity community for some reason.

In PC2, we can keep separate values gathered from the different data sources and provide some way to remotely influence the frequency of the information transmission and weights, which affect the probability with which the algorithm decides which torrent info should be transferred to peers. This way, it would be possible to experiment with different algorithm parameters without releasing a new version of Tribler each time a new experiment is performed.

ichorid commented 4 years ago

Indeed, this is a very good idea to go "bottom-up" and check individual stats received by different methods, and aggregate the data later. It is much better aligned with the decentralization philosophy. Also, in this way we can truly parallelize data gathering.

Regarding the "very popluar but not shown" torrents, these are "unknown unknowns" and I would not worry about them. Catching these will require passive listening to DHT traffic, akin to what some "download-revealing" websites do.

For PC2, I don't think we should be able to influence how other nodes calculate their PC stats, because it is too dangerous and easy to exploit. Instead, we can implement something like I did with RemoteQueryCommunity: a generic PC-related query to get popularity info from users' DBs. This will allow us to crawl the network, get the base stats, and then model different ways to calculate the popularity rating locally, in the lab.

qstokkink commented 2 months ago

This suggestion has led to the creation of #7586 and further discussion (if needed) can take place there.