content popularity community: performance evaluation

synctext commented 6 years ago

For context, the long-term megalomaniac objectives (update Sep 2022):	Layer	Description
User experience	perfect search in 500 ms and asynchronously updated :heavy_check_mark:
Relevance ranking	balance keyword matching and swarm health
Remote search	trustworthy peer which has the swarm info by random probability
Popularity community	distribute the swarm sizes
Torrent checking

After completing the above, next item: Add tagging and update relevance ranking. Towards perfect metadata.
De-duplication of search results.
Also find non-matching info. Search for Linux, find items tagged Linux, Biggest Ubuntu swarm is shown first.
Added to that is adversarial information retrieval for our Web3 search science. After above is deployed and tagging is added. Cryptographic protection of above info. Signed data needs to have overlap with your web-of-trust, unsolved hard problem.
personalised search
3+ years ahead: row bundling

@arvidn indicated: tracking popularity is known to be a hard problem.

I spent some time on this (or a similar) problem at BitTorrent many years ago. We eventually gave
up once we realized how hard the problem was. (specifically, we tried to pass around, via gossip,
which swarms are the most popular. Since the full set of torrents is too large to pass around,
we ended up with feedback loops because the ones that were considered popular early on got
disproportional reach).

Anyway, one interesting aspect that we were aiming for was to create a "weighted" popularity,
based on what your peers in the swarms you participated in thought was popular. in a sense,
"what is popular in your cohort".

We deployed the first version into Tribler #3649 , after prior Master thesis research #2783. However, we lack documentation or specification of the deployed protocol.

Key research questions:

What is the real deployed system behavior?
What is the resource consumption?
What is the accuracy and quality in general of the information?
How can we attack or defend this IPv8 community?

Concrete graphs from a single crawl:

messages and bandwidth in time
hashes discovery and duplicates
distribution of discovered popularity of swarms
conduct swarm popularity check and compare results in real-time
behavior of pub/sub mechanism for popularity feed
dynamics of trust-based pub/sub auto-subscribe

Implementation of on_torrent_health_response(self, source_address, data) ToDo @xoriole : document deployed algorithm in 20+ lines (swarm check algorithm, pub/sub, hash selection algorithm,handshakes, search integration, etc.).

kozlovsky commented 1 year ago

I also have found that literally all automatic checks in TorrentChecker were broken. The first (check_random_tracker) is broken because it performs the check, but didn't save the results into DB:

I suspect you are right, and this check does not store the received results in the DB

The second (check_local_torrents) is broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function): The third (check_torrents_in_user_channel) is also broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function):

I think these checks work properly. The function they call is not actually async:

@task
async def check_torrent_health(self, infohash, timeout=20, scrape_now=False):
    ...

This function appears to be async, but it is actually synchronous. The @task decorator converts an async function to a sync function that starts an async task in the background. This is non-intuitive, and PyCharm IDE does not understand this.

We should probably replace the @task decorator with something that IDE better supports to avoid further confusion.

synctext commented 1 year ago

New PR for torrent checking. After the release is done of Tribler 7.13 we can re-measure the popularity community. ToDo @xoriole. The new code deployment will hopefully fix issues and improve the general health and efficiency of popularity info for each swarm.

synctext commented 1 year ago

The graph below shows the received number of torrents (unique & total), total messages and peers discovered per day by the crawler running Popularity Community in observer mode for 95 days. The crawler is running with an extended discovery booster which leads to discovering more torrents.

Lets focus on the latest 7.13 release and re-measure. Can we find 1 million unique swarm in 50 days? How many long-tail swarms are we finding. Did we achieve the key goal of tracking 100k swarms for the {rough} popularity. If we succeeded we can move on and focus fully on bug fixing, tooling, metadata enrichment, tagging and semantic search.

@egbertbouman Measure for 24 hours the metadata at runtime we get from the network:

unknown hashes and duplicates from PopularityCommunity
- seed/leech of each swarm
- (in later experiment) determine ground truth
unknown hashes and duplicates from FreeForAll in giga-channels (? through inheretance of Gigachannel, https://github.com/Tribler/tribler/blob/main/src/tribler/core/components/metadata_store/remote_query_community/remote_query_community.py)
(???)unknown hashes and duplicates from channels pre-views and sampling ???
Overlap between these sources of metadata?

egbertbouman commented 1 year ago

Just did a simple experiment with Tribler running idle for 24h in order to get some idea about which mechanisms in Tribler are primarily responsible for getting torrent information. It turns out that the vast majority of torrents are discovered through the lists of random torrents gossiped within the PopularityCommunity.

torrent_collecting

When zooming in on the first 10 minutes, we see that a lot of discovered torrents are discovered by the GigaChannelCommunity (preview torrents). The effect of preview torrent quickly fades as preview torrents are only collected onup the discovery of new channels.

torrent_collecting

While examining the database after the experiment had completed, it turned out that a little under 80% of all torrent were "free-for-all" torrents, meaning that they did not have a public_key/signature and were put on the network after a user manually requested the torrentinfo.

ichorid commented 1 year ago

Also, if you'd look at the top channels, they all were updated 2-3 years ago. That means their creators basically dumped a lot of stuff and then forgot about it, lose interest in it. But their creations continue to dominate the top, which is now a bunch of :zombie: :zombie: :zombie:

"Free-for-all" torrents gossip is the most basic form of collective authoring, and it trumped over the channels in real usage (e.g., search and top torrents) as @egbertbouman demonstrated above. This means that without a collective editing feature Channels are useless and misleading.

Overall, I'd say Channels must be removed (especially because of its clumsy "Channel Torrent" engine), and replaced with some more bottom-up system, e.g. collectively edited tags.

synctext commented 1 year ago

:astonished: :astonished: :astonished: 650 torrents are collected by Tribler in the first 20 seconds after startup?

When zooming in on the first 10 minutes, we see that a lot of discovered torrents are discovered by the GigaChannelCommunity (preview torrents).

Now we understand a big source of performance loss. Gigachannels is way too agressive in the first seconds and first minute of startup. No rate control or limit of IO or networking (32 torrents/second). Should be smoother and first rely on RemoteSearch results. Tribler will get unresponsive I believe on slow computers with this aggressive content discovery. Great work on the performance analysis! We should do that with all our features. No fix needed now! Only awareness that we need to monitor our code consistently for performance.

synctext commented 12 months ago

Web3Discover: Trustworthy Content Discovery for Web3

@xoriole would be great to write a scientific arxiv paper on this, beyond traditional developer role. Also contributing to performance evaluation and scientific core. See IPFS example reporting, tooling, and general pubsub work plus "GossipSub: Attack-Resilient Message Propagation in the Filecoin and ETH2.0 Networks".

Problem description, design, Graphs, performance analysis, and vital role for search plus relevance ranking. Timing: finish before Aug 2024 (end contract).

grimadas commented 1 month ago

The problem remains relevant to this day. I see two main issues and possible directions for improvement:

Zero-trust architecture is expensive. If I only trust my own node, I need to check all torrents, which is a costly operation. Solution: Enhance the torrent checker with reputation information. Include the source when you receive information about torrents checked by others. When verifying torrent health, cross-check with other reported sources. If the reported information is significantly inaccurate, downgrade the source's reputation. If the information seems "good enough," slightly increase the source's reputation. This approach enables us to have torrent information enhanced with network health checks, not just checks from our own node. Challenge: Torrent health is always dynamic data. We need a reliable estimation to determine what is significantly inaccurate and what is "good enough."
Biased health checks are not optimal. Currently, many peers perform redundant work by repeatedly checking the same popular torrents. Can we improve this process? Perhaps we could integrate information about who has already checked the torrents and how much we trust them. Simplistic gossip mechanism: Randomly send information about 5 popular, checked torrents. While this could lead to repetition and over-redundancy, we need to demonstrate that this is indeed a problem (using a graph). With baseline metrics, we can identify opportunities for further improvement.

Tribler / tribler

content popularity community: performance evaluation #3868

Web3Discover: Trustworthy Content Discovery for Web3