Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.86k stars 451 forks source link

content popularity community: performance evaluation #3868

Open synctext opened 6 years ago

synctext commented 6 years ago
For context, the long-term megalomaniac objectives (update Sep 2022): Layer Description
User experience perfect search in 500 ms and asynchronously updated :heavy_check_mark:
Relevance ranking balance keyword matching and swarm health
Remote search trustworthy peer which has the swarm info by random probability
Popularity community distribute the swarm sizes
Torrent checking image
  1. After completing the above, next item: Add tagging and update relevance ranking. Towards perfect metadata.
  2. De-duplication of search results.
  3. Also find non-matching info. Search for Linux, find items tagged Linux, Biggest Ubuntu swarm is shown first.
  4. Added to that is adversarial information retrieval for our Web3 search science. After above is deployed and tagging is added. Cryptographic protection of above info. Signed data needs to have overlap with your web-of-trust, unsolved hard problem.
  5. personalised search
  6. 3+ years ahead: row bundling

@arvidn indicated: tracking popularity is known to be a hard problem.

I spent some time on this (or a similar) problem at BitTorrent many years ago. We eventually gave
up once we realized how hard the problem was. (specifically, we tried to pass around, via gossip,
which swarms are the most popular. Since the full set of torrents is too large to pass around,
we ended up with feedback loops because the ones that were considered popular early on got
disproportional reach).

Anyway, one interesting aspect that we were aiming for was to create a "weighted" popularity,
based on what your peers in the swarms you participated in thought was popular. in a sense,
"what is popular in your cohort".

We deployed the first version into Tribler #3649 , after prior Master thesis research #2783. However, we lack documentation or specification of the deployed protocol.

Key research questions:

Concrete graphs from a single crawl:

Implementation of on_torrent_health_response(self, source_address, data) ToDo @xoriole : document deployed algorithm in 20+ lines (swarm check algorithm, pub/sub, hash selection algorithm,handshakes, search integration, etc.).

xoriole commented 6 years ago

Popularity Community Introduction Popularity community is a dedicated community to disseminate popular/live contents across the network. The content could be anything for eg. health of a torrent, a list of popular torrents or even search results. The way of dissemination of the content follows the publish-subscribe model. Each peer in the community is both a publisher and a subscriber. A peer subscribes to a set of neighboring peers to receive their content updates while it publishes its content updates to the peers subscribing it. pub-sub

Every peer maintains a list of subscribing and publishing peers with whom it exchanges content. All contents from non-subscribed publishers are basically refused. Selection of peers to subscribe or to publish greatly influences the dissemination of the content both genuine and spam. Therefore, we try to select based on a simple trust score. Trust score indicates the number of times we have interacted with the node as indicated by the number of mutual Trustchain blocks. Higher the trust score better the chance of being selected (as publisher or subscriber).

Research questions ...

synctext commented 5 years ago

ToDo: describe the simplified top-N algorithm that is more light-weight (no pub/sub). As-simple-as-possible gossip. Measure and plot 4 graphs listed above

synctext commented 4 years ago

Bumping this issue. The key selling point of Tribler 7.6 is popularity community maturing (good enough for coming 2 years) and superior keyword search using relevance ranking. goal: 100k swarm tracking.

This has priority on channel improvements. Our process is to bump each critical features to a superior design and move to the next. Key lesson within distributed systems is: you can't get it perfect the first time (unless you have 20 years of failure experience). iteration and relentless improving deployed code is key.

After we this close performance evaluation issue we can build upon it. We need to know how well it performs and tweak it for 100k swarm tracking. We can do 1st version of real-time relevance ranking. Read our 2010 work for background: Improving P2P keyword search by combining .torrent metadata and user preference in a semantic overlay

Repeating key research questions from above (@ichorid):

Concrete graphs from a single crawl:

synctext commented 4 years ago

See also #4256 for BEP33 measurements&discussion

synctext commented 4 years ago

Please check out @grimadas tool for crawling+analysing Trustchain and enhance this for the popularity community: https://github.com/Tribler/trustchain_etl

synctext commented 4 years ago

Hopefully we can soon add the health of the ContentPopularity Community to our overall dashboard.

xoriole commented 4 years ago

Screenshot from 2020-09-13 19-11-03

Currently, a peer shares its most popular 5 and random 5 torrents checked by the peer to its connected neighbors. Since, a peer starts sharing them from the beginning, its not always the case the popular torrents are shared. This results in sharing torrents that doesn't have enough seeders (see SEEDERS_ZERO count), and this does not contribute much in sharing of popular torrents. So, two things that could improve sharing popular torrents seems like:

  1. not sharing zero seeder torrents
  2. increasing the initial buffer time before sharing is started

https://jenkins-ci.tribler.org/job/Test_tribler_popularity/plot/

devos50 commented 4 years ago

Nice work! I assume that this experiment is using the live overlay?

As a piece of advice, I would first try to keep the mechanism simple for now, while analyzing the data from the raw network (as you did right now). Extending the mechanism with (arbitrary) rules might lead to bias results, which I learned the hard way when designing the matchmaking mechanism in our decentralized market. Sharing of the 5 popular and 5 random torrents might look like a naive sharing policy, but it might be a solid starting point to get at least a basic popularity gossip system up and running.

Also, we have a DAS5 experiment where popularity scores are gossiped around (which might actually be broken after some channel changes). This might be helpful to test specific changes to the algorithm before deploying them 👍 .

xoriole commented 4 years ago

@devos50 Yes, it is using live overlay.

Also, we have a DAS5 experiment where popularity scores are gossiped around (which might actually be broken after some channel changes). This might be helpful to test specific changes to the algorithm before deploying them.

Yes, good point. I'll create experiments to test the specific changes.

synctext commented 4 years ago

Thnx @xoriole! We now have our first deployment measurement infrastructure, impressive.

Can we (@kozlovsky @drew2a @xoriole) come up with a dashboard graph to quantify how far we are to our Key Performance Indicator: the goal of tracking 100k swarms? To kickstart the brainstorm:

increasing the initial buffer time before sharing is started

As @devos50 indicated, this sort of tuning is best preserved for last. You want to have an unbiased view of your raw data for as long as possible. Viewing raw data improves accurate understanding. {Very unscientific: we design this gossip stuff with intuition. If we have 100+ million users people would be interested in our design principle.}

Repeating long-term key research questions from above (@ichorid):

ichorid commented 4 years ago
  1. not sharing zero seeder torrents

For every popular torrent, there are a thousand of dead ones. Therefore, information about what is alive is much more precious and scarce then about what is dead. It will be much more efficient to only share torrents that are well seeded.

Though, the biggest questions are:

ichorid commented 4 years ago
  • What is the resource consumption?
  • 3065 Fix for DHT spam using additional deployed service infrastructure

It would be very nice if we find (or develop) some Python-based Mainline DHT implementation, to precisely control the DHT packets parameters.

  • How can we attack or defend this IPv8 community?
:crossed_swords: attack :shield: defence
spam stuff around pull-based gossip
fake data cross-check data with others
biased torrent selection pseudo-random infohash selection
(e.g. only send infohashes sharing some number of last bytes)
drew2a commented 4 years ago

We still have no measures for "popularity community".

I will describe a few experiments below. Maybe they will be helpful for a developing more scientific approach.

Metric 1: how fast a new user can get the list of popular torrents

Experiment 1.1

Given the network of 100 nodes. Each node has a list of 100K popular torrents. We add a new node to the network.

Question: how long will it take to deliver a 100K list to the new node?

Experiment 1.2:

The same as 1.1, but 1K nodes.

Metric 2: how fast an empty network will collect 100K popular torrents

Experiment 2.1

Given the network of 100 nodes. Each node has an empty list of popular torrents.

Question: how long will it take to get a 100K list on each node?

Additional data:  bandwidth consuming over time. Additional data:  lists filling over time.

Experiment 2.2:

The same as 2.1, but 1K nodes.

Metric 3. how heterogeneous lists are

Experiment 3

Given: the network, after 2.1 experiment. We calculate the common part of all popular lists (the common part of "100k torrent list" on each node) [hereinafter CommonList]

Question: what size of CommonList (in percent)?

Metric 4: quality of a popular list

Experiment 4

Given: the network after experiment 2.1. We calculate CommonList. We somehow receive a reference list (may be it is a static "human-made" list) [hereinafter ReferenceList] We compare CommonList to the ReferenceList.

Question: Which percentage of the CommonList and ReferenceList is the same?

devos50 commented 4 years ago

@drew2a good suggestions, thanks!

Note that many of these experiments can be performed in isolation on our (nation-wide) compute cluster, the DAS5. We have the necessary tools in Gumby to easily create new overlays on this cluster and to connect peers with each other. Gumby also allows plotting of resource consumption. Hopefully, we will soon have a few additional servers operational to conduct experiments.

We already have a basic DAS5 experiment that starts a few Tribler instances and shares popularity vectors. This would be a good starting point for anyone that quickly wants to evaluate the effectiveness of popularity gossip strategies. However, I believe that this issue concerns a performance evaluation of our live network.

ichorid commented 4 years ago

"Experiment 1.1" is basically the experiment @devos50 made for the first implementation of GigaChannels two year ago. The answer is: "very fast, but gets increasingly slower as the list grows". Also, I doubt there are over 10k popular torrents at any moment in the whole BitTorrent network. Some (dated) insights into the BitTorrent network can be found in @synctext 's seminal works.

There are about 20 popular trackers, each one serving about 0.5-2M torrents. Less than 1% of torrents in any category are "alive". See torrents.csv project. Overall, torrent popularity is pretty transient.

ichorid commented 4 years ago

"Experiment 2" depends on two factors:

AFAIK, one can't simply go to a BitTorrent node and ask it for its list of seeded infohashes ( :eye: :ok_hand:). Instead, one must know the infohash and ask the client about it. Maybe there is some BEP-extension that implements querying clients for lists of infohashes, but that would be a great privacy hole (and thus highly unlikely to be accepted by BitTorrent community).

Also, DHT has some flood-protection that already took a toll on our developers.

synctext commented 4 years ago

Great fan of these concrete experiments to collect hard data. We need performance data from emulation. Hence the dashboard idea.

First priority coming sprint weeks: build backwards-compatible PopularityCommunity and all known fix bugs in there, and try to boost performance. Preferably that performance is improved, even with the existing deployed community.

We thus do integration testing, compatibility testing, regression testing, and performance analysis into the Multi-Aspect Sprint Cycles :-) Big step forward: https://github.com/xoriole/tribler/blob/popularity-helper/src/tribler-core/run_popularity_helper.py

ichorid commented 4 years ago

From this paper

We trace the popularity of those objects by counting the number of requests they receive per week for the entire eight months of our measurement study. Fig. 4 shows that popular objects gain popularity in a relatively short timescale reaching their peak in about 5–10 weeks. The popularity of those objects drops dramatically after that. As the figures show, we observe as much as a sixfold decrease in popularity in a matter of 5–10 weeks.

изображение

According to that paper, the content popularity follows Mandelbrot-Zipf distribution.

Unfortunately, I (almost) completely forgot my calculus course, so I can't integrate anymore (except for the simplest stuff). Now, if we would have a mathematician who can integrate the Mandelbrot-Zipf distribution and fit the total number of entries to the already known BitTorrent network stats (see my post above about 40M torrents)... Then we could predict the peak swarm size and tune our experiments accordingly...

@alexander-stannat ?

xoriole commented 4 years ago

Results of popularity community from continuous 1 hour execution. Plots available in Jenkins

Screen Shot 2020-10-19 at 15 45 23 Screen Shot 2020-10-19 at 15 45 10 Screen Shot 2020-10-19 at 15 44 55 Screen Shot 2020-10-19 at 15 44 29 Screen Shot 2020-10-19 at 15 48 38 Screen Shot 2020-10-19 at 15 48 55 Screen Shot 2020-10-19 at 15 49 07
ichorid commented 4 years ago

we ended up with feedback loops because the ones that were considered popular early on got disproportional reach

I've come up with a simple algorithm on how to solve the feedback problem. The basic idea is to emulate how news spread trough human society:

  1. share more important thoughts more often.
  2. over the time the urge to share a thought diminishes ("become bored of the idea").
  3. if you hear other people repeating your idea, reduce the urge to share it (the "old news" effect).
  4. if someone's claim contrasts your thoughts to the point it will change your behaviour, check the fact yourself. If the check fails, reject the claim and notify the claimer.

This list of rules guarantees that information about popular torrents will propagate quickly (1), but will not dominate the gossip (3) and die out naturally (2). Also, it will prevent the spreading of "fake news" (4).

EDIT: basically, already invented in "Top-k Item Identification on Dynamic and Distributed Datasets"

synctext commented 3 years ago

Just bumping this issue in importance. We need to fix this community. "Donate my VPN bandwidth to Tribler", that would solve matters. With many IPv4 addresses we can crawl torrents and even join them to check the ground truth. Crawl the contents of channels your joined and gossip.

synctext commented 3 years ago

We started working on this on Feb 6, 2017, see issue #2783. That is over 4 years and 2 months! Ambition level is now reduced. This works only if:

Something is starting to work within channels :heavy_check_mark: image

xoriole commented 3 years ago

The graph below shows number of times different popular torrents received by a single Tribler peer in the period of 21 hours via Popularity community. There are over 23k torrents but the graph shows only top 1000. It is a long tail.

Shared torrents distribution - message (in 21 hours)

Top 10 shared torrents: top10-torrents

The graph below shows the number of times different peers shared popular torrents to the observer Tribler peer in the same time duration. Shared torrents distribution - node (in 21hours)

The graph below shows the torrent distribution and difference in the seeder count across all messages. Shared torrents distribution w_ seeder diff%

Out of 23k torrents, majority of torrents shared have zero (or low) seeder count. (Note that these are not dead torrents). There are hundreds of torrents shared thousands of times with the same health (seeder) count by several peers. This information can likely be used to determine the trust of the received torrent health information and/or the sender peer.

Points of discussion:

  1. If several peers share same health information over time, can this health information be trusted? If yes, what could be acceptable criteria
  2. The most popular torrent health information was received over 5k times by 160 peers in 21 hours (almost every 15 seconds). This sharing can be made less aggressive and more inclusive so more torrents are included instead of repeating the same torrents.
synctext commented 3 years ago

Great work! As discussed yesterday: please do not make any radical changes and no Bloom filters. There is a systematic bias to repeating the same torrent. That leads to duplicate information. Do not use any new ideas please. Just remove the bias for big swarms, deploy, and measure resulting improvement. This field is ancient (but all this prior work ignored security, therefore of limited usage. With Trustchain we moved beyond this). Things like "peer sampling" without a web of trust leave your system defenceless against spam or Sybil attack. Feel free to take the time to understand much of the prior work discussed in these 205 slides. Note especially the naive security assumption on slide 100, they test with 2% attackers in the overlay for a "secure peer sampling" paper http://sbrc2010.inf.ufrgs.br/resources/presentations/tutorial/tutorial-montresor.pdf

<rant>The key idea is to keep things as simple as possible. Don’t needlessly complicate things. This is very weird, but optimisation usually lead to complexity. For gossip protocols, when you add complexity you're doing it wrong. You need to think differently, randomness and some repeating create robustness, resilience, and strength. Obviously, identical messages repeating 1000s of times are wrong. Its quite complicated to create simple systems. That is why science failed so far to make re-usable gossip tooling. Tribler is designed to pioneer such simple proven building blocks. Not by starting out with a generic tool, but first make something that works for a million people and evolve in years to come.

xoriole commented 3 years ago

The graphs below show the observation of Popularity Community by a single Tribler peer in the period of ~21 hours. Experiment 1: Current behavior of the community (V7.9.0-RC1) Experiment 2: Updated to use random_walk and remove_peers strategy so that the observer peer can find more peers in the network.

Experiment 1 graphs are the same as the previous comment in this issue


Experiment 1 1  Shared torrents distribution - message (in 21 hours)

Experiment 2 2  Shared torrents distribution - message (in 21 hours)


Experiment 1 1  Shared torrents distribution - node (in 21 hours)

Experiment 2 2  Shared torrents distribution - node (in 21 hours)


Experiment 1 1  Shared torrents distribution w_ seeder diff%

Experiment 2 2  Shared torrents distribution w_ seeder diff%


Observations:

ichorid commented 3 years ago

The effect of change in experiment 2 is limited to the observer node only, it would be interesting to see the network behavior when more nodes participate with the changes included. This will require a separate network experiment.

I suggest converting Popularity Community to pull-based gossip, RQC-style. That will allow for much easier experimentation and spam-resistance.

Alternatively, stop sending popular torrents altogether, and instead just send random torrents instead. That should flatten the curve.

synctext commented 3 years ago

stop sending popular torrents altogether, and instead just send random torrents instead.

Great idea. Please try to get this deployed for the next release. No changes to any other parts of this community. Lets see how that compares.

xoriole commented 3 years ago

Observation from experiment 3: Current behavior of v7.10-exp1 Important changes:

Graph: x-axis represents the unique torrents received by the observer node

3  Shared torrents distribution - message (in 21 hours)

A few popular torrents were shared a large number of times. Instead of flattening the curve, we obtained a sharper peak at 15k (compared to 5k, 6k in earlier experiments)


3  Shared torrents distribution - node (in 21 hours) The number of peers discovered is high, that is expected considering the change in strategy.


3  Shared torrents distribution w_ seeder diff%

Torrent distribution considering the difference in seeder values is consistent with the previous experiment.

ichorid commented 3 years ago

A few popular torrents were shared a large number of times. Instead of flattening the curve, we obtained a sharper peak at 15k (compared to 5k, 6k in earlier experiments)

Are you running your experiments on the main Tribler network? If so, this is expected because you use push-based gossip, meaning that the only thing that effectively changed for a single host running 7.10 in a sea of 7.9 is the faster peers discovery. Which, indeed, should sharpen the peak.

xoriole commented 3 years ago

Are you running your experiments on the main Tribler network? If so, this is expected because you use push-based gossip, meaning that the only thing that effectively changed for a single host running 7.10 in a sea of 7.9 is the faster peers discovery. Which, indeed, should sharpen the peak.

Yes, it is on the main Tribler network. Having almost 3 times the earlier peak was not something I expected. I think you're right, faster discovery and abundance of v7.9 peers which send combined popular and random torrents message is responsible for the spike. It should decline once there are more v7.10 nodes since the frequency of share for popular torrents is reduced in v7.10.

As an aside, I propose to add a message type to request/respond client version in RemoteQueryCommunity. It'll be useful in the experiments to confirm the distribution.

synctext commented 3 years ago

Popularity community is starting to work nicely. Tribler 7.10 keyword search finds good swarm. :grinning: :partying_face: :grinning:

Yet another scientific challenge for 2022 is reducing the altruistic peer discovery time. Or fork the Bittorrent protocol and only create swarms with proper altruism (seed for 2 years). We see how few peers in swarms actively upload. For instance, this newly swarm has reportedly 375 seeders. It takes typically 60-300 seconds before you find the altruistic seeders: Tribler7 10__375seeders_misterious_shrinking_swarms

synctext commented 2 years ago

In a few days the new 7.12 release will be out! :clap: Please post the latest performance analysis of your algorithms here @xoriole.

synctext commented 2 years ago

New measurement results are hopefully posted here by @xoriole for 6 Sep 2022 Dev meeting :crossed_fingers:

xoriole commented 2 years ago

The graph below shows the received number of torrents (unique & total), total messages and peers discovered per day by the crawler running Popularity Community in observer mode for 95 days. The crawler is running with an extended discovery booster which leads to discovering more torrents.

Popularity Community (95 days) - updated

synctext commented 2 years ago

Comments on this measurement:

drew2a commented 2 years ago

@xoriole please, correct be if I'm wrong. In the case the crawler use the default DicoveryBooster, neighborhood_size should be equal to 25 and edge_length should be equal to 25.

https://github.com/Tribler/tribler/blob/912c6f0ab95f30be550067f9778db1df1df18ac9/src/tribler/core/components/ipv8/discovery_booster.py#L53-L56

xoriole commented 2 years ago

Frozen experiment

Seeders (reported and checked)

Same measurement, but now reporting leechers instead of seeders. Absolute number of leechers, using a standard linear scale:

Leechers (reported and checked)

Now we represent the same number in terms of percentage in an attempt to normalize the values.

Seeders % = ( reported seeders / checked seeders ) x 100 %
Leechers % = ( reported leechers / checked leechers ) x 100 %
Peers % = ( (reported seeders + reported leechers)  / (checked seeders + checked leechers) ) x 100 %

Peers and seeders (2) Peers and leechers (2)

Observations

synctext commented 2 years ago

Note that remote results and popularity community differ in algorithm. BEP33 and central swarms are simply not sufficiently reliable for a robust, attack-resilient, and quality product that we as scientists strive for. The problem we are trying to solve is not accurate statistics, but just "bad swarm", versus "good swarm". We need more experiments around Libtorrent join stats.

Next sprint: understand popularity community ground truth?

xoriole commented 2 years ago

Popularity community experiment The purpose of the experiment is to see how the torrent health information received via the popularity community differs when checked locally by joining the swarm.

From the popularity community, we constantly receive a set of tuples (infohash, seeders, leechers, last_checked) representing the popular torrent with their health (seeders, leechers) information. This health information is supposed to be obtained by the sender by checking the torrent themselves so the expectation is that the information is relatively accurate and fresh.

In the graph below, we show how the reported (or received) health info and checked health info differ for the 24 popular torrents received via the community.

First considering the seeders. Since the variation in the number of seeders for different torrents is high, a logarithmic scale is used. Sept - Seeders (reported and checked)

Similarly for the leechers, again logarithmic scale is used. Sept - Leechers (reported and checked)

Here each individual torrent is unrelated to each other and could be more or less popular depending on what content they represent so seeders, leechers, and peers (= seeders + leechers) are represented in the percentage of their reported value in an attempt to normalize them.

Seeders % = ( checked seeders / reported seeders ) x 100 %
Leechers % = ( checked leechers / reported leechers ) x 100 %
Peers % = ( ( checked seeders + checked leechers) / ( reported seeders + reported leechers ) ) x 100 %

Peers and seeders

Peers and leechers

Observerations

synctext commented 2 years ago
Writing down our objectives here: Layer Description
Relevance ranking It is show to the user within 500 ms and asynchronously updated
Remote search trustworthy peer which has the swarm info by random probability
Popularity community distribute the swarm sizes
Torrent checking image
  1. Add tagging and update relevance ranking. Towards perfect metadata.
  2. Added to that is adversarial information retrieval for our Web3 search science. After above is deployed and tagging is added. Cryptographic protection of above info. Signed data needs to have overlap with your web-of-trust, unsolved hard problem.

background Getting this all to work is similar to making a distributed Google. Everything needs to work and needs to work together. Already in 2017 we tried to find the ground-truth on the perfect matching swarm for a query. We have a minimal swarm crawler (2017). "Roughly 15-KByte-ish of cost for sampling a swarm (also receive bytes?). Uses magnet links only. 160 Ubuntu swarms crawled": image Documented torrent checking algorithm? Documented popularity community torrent selection and UDP/IPv8 packet format? Readthedocs Example "latest/search_architecture.html"

synctext commented 1 year ago

Initial documentation of deployed Tribler 7.12 algorithms

xoriole commented 1 year ago

Repeating the Popularity community experiment here.

Similar to the experiment done in September, here we show how the reported (or received) health info and checked health info differ for the 24 popular torrents received via the community.

The numbers represented in the graph are count values and the scale used in the graph is logarithmic for better comparison since the variation in the values is large.

A. Based on count

Dec-Seeders (reported and checked)

Dec-Leechers (reported and checked)

B. Normalized in percentages

Seeders % = ( checked seeders / reported seeders ) x 100 %
Leechers % = ( checked leechers / reported leechers ) x 100 %
Peers % = ( ( checked seeders + checked leechers) / ( reported seeders + reported leechers ) ) x 100 %

Dec - Peers and seeders Dec - Peers and leechers


Observerations

absolutep commented 1 year ago

In overall, the seeders, leechers, and peers percentage has decreased significantly compared to the September measurement.

I would point out another reason - lower number of users using Tribler might skew the results or atleast give an erratic response.

I do not know why but it seems that userbase has decreased quite a lot.

For newer torrents, I get downloading/uploading speeds of around 20MBPS in qBitTorrent (without VPN) but on Tribler I hardly cross maximum of 4MBPS (without hops).

Is this because of low number of users or unable to connect to peers or cooperative downloading - that I have no technical knowledge of?

synctext commented 1 year ago

@absolutep Interesting thought, thx! We need to measure that and compensate for that.

@xoriole The final goal of this work is to either write or contribute the technical content to a (technical/scientific) paper, like: https://github.com/Tribler/tribler/files/10186800/LTR_Thesis_v1.1.pdf We're very much not ready for machine learning. But for publication results its strangely easy to mix measurement of a 17 years deployed system with simplistic Python Jupyter notebooks with machine learning. Key performance indicator: zombies in top-N (1000). Agree with key point you raised: stepping out of the engineering mindset. Basically we're spreading data nicely and fast, its only a bit wrong (e.g. 296.44% :joy: ) Lesson learned: started simple, working, and inaccurate. Evolved complexity: we need a filter step and measure again later in time (e.g. re-measure, re-confirm popularity). Reactive, pro-active, or emergent design. Zero trust architecture: trust nobody but yourself. We have no idea actually. So just build, deploy, and watch what happens. Actually we need to know the root cause of failure. Without understanding the reason for wrong statistics, we're getting nowhere. Can we reproduce the BEP33 error, for instance? Therefore, analysis of 1 month system dynamics and faults. Scientific related work (small sample from this blog on Google Youtube): image Scientific problem is item ranking. What would be interesting to know is: how fast does the frontpage of Youtube change with the most-popular videos? Scientific article by Google: Deep Neural Networks for YouTube Recommendations.

synctext commented 1 year ago

Discussed progress, next sprint: how good is are the popularity statistics with latest 12.1 Tribler (filtered results, compared to ground truth)? DHT self-attack issue to investigate next?

xoriole commented 1 year ago

Comparing the results from the naked libtorrent and the Tribler, I found that the results of the torrent check of the popular torrents received via the popularity community when checked locally results in dead torrents which is likely not the case. This is because of the issue in torrent checker (DHT Session checker). After BEP33 is removed, the earlier way of getting the health response mostly returns in zero seeders and zero or some leechers, this in the UI shows as

drew2a commented 1 year ago

Could this bug (https://github.com/Tribler/tribler/issues/6131) relate to the described issues?

xoriole commented 1 year ago

Could this bug (#6131) relate to the described issues?

Yes, it is same bug

drew2a commented 1 year ago

While working on https://github.com/Tribler/tribler/pull/7286 I've found a strange behavior that may shed light on some of the other oddities.

If TorrentChecker performs a check via a tracker, then returned values always look ok-ish (like 'seeders': 10, 'leechers': 77).

If TorrentChecker performs a check via DHT, then returned seeders are always equal to 0 (like 'seeders': 0, 'leechers': 56)

Maybe it is a bug that @xoriole describes above.


UPDATED 03.02.22 after verification from @kozlovsky

I also found that one automatic check in TorrentChecker was broken. I also have found that literally all automatic checks in TorrentChecker were broken.

There are three automatic checks: https://github.com/Tribler/tribler/blob/87916f705eb7e52da828a14496b02db8d61ed5e9/src/tribler/core/components/torrent_checker/torrent_checker/torrent_checker.py#L72-L75

The first (check_random_tracker) is broken because it performs the check, but didn't save the results into DB:

https://github.com/Tribler/tribler/blob/87916f705eb7e52da828a14496b02db8d61ed5e9/src/tribler/core/components/torrent_checker/torrent_checker/torrent_checker.py#L159-L163

The second (check_local_torrents) is broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function).

The third (check_torrents_in_user_channel) is also broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function).

CC: @kozlovsky

drew2a commented 1 year ago

Also, I'm posting an algorithm example of getting seeders' and leechers' in case there is more than one source of information available.

  1. TorrentChecker checks the seeders' and leechers' for an infohash.
  2. TorrentChecker sends a DHT request and a request to a tracker.
  3. TorrentChecker receives two answers. One from DHT and one from the tracker:
    • DHT_response= {"seeders": 10, "leechers"=23}
    • tracker_response={"seeders": 4, "leechers"=37})
  4. TorrentChecker picks the answer with the maximum seeders' value. Therefore the result is:
    • result={"seeders": 10, "leechers"=23}
  5. TorrentChecker saves this information to the DB (and propagates it through PopularityCommunity later).

Proof: https://github.com/Tribler/tribler/blob/87916f705eb7e52da828a14496b02db8d61ed5e9/src/tribler/core/components/torrent_checker/torrent_checker/torrent_checker.py#L320-L324

Intuitively it is not the correct algorithm. Maybe we should use the mean function instead of the max.

Something like:

from statistics import mean

DHT_response = {'seeders': 10, 'leechers': 23}
tracker_response = {'seeders': 4, 'leechers': 37}

result = {'seeders': None, 'leechers': None}
for key in result.keys():
    result[key] = mean({DHT_response[key], tracker_response[key]})

print(result)  # {'seeders': 7, 'leechers': 30}

Or we might prioritize the sources. Let's say:

  1. Tracker (more important)
  2. DHT (less important)