Open synctext opened 6 years ago
I also have found that literally all automatic checks in TorrentChecker were broken. The first (check_random_tracker) is broken because it performs the check, but didn't save the results into DB:
I suspect you are right, and this check does not store the received results in the DB
The second (check_local_torrents) is broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function): The third (check_torrents_in_user_channel) is also broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function):
I think these checks work properly. The function they call is not actually async:
@task
async def check_torrent_health(self, infohash, timeout=20, scrape_now=False):
...
This function appears to be async, but it is actually synchronous. The @task
decorator converts an async function to a sync function that starts an async task in the background. This is non-intuitive, and PyCharm IDE does not understand this.
We should probably replace the @task
decorator with something that IDE better supports to avoid further confusion.
New PR for torrent checking. After the release is done of Tribler 7.13 we can re-measure the popularity community. ToDo @xoriole. The new code deployment will hopefully fix issues and improve the general health and efficiency of popularity info for each swarm.
The graph below shows the received number of torrents (unique & total), total messages and peers discovered per day by the crawler running Popularity Community in observer mode for 95 days. The crawler is running with an extended discovery booster which leads to discovering more torrents.
Lets focus on the latest 7.13 release and re-measure. Can we find 1 million unique swarm in 50 days? How many long-tail swarms are we finding. Did we achieve the key goal of tracking 100k swarms for the {rough} popularity. If we succeeded we can move on and focus fully on bug fixing, tooling, metadata enrichment, tagging and semantic search.
@egbertbouman Measure for 24 hours the metadata at runtime we get from the network:
Just did a simple experiment with Tribler running idle for 24h in order to get some idea about which mechanisms in Tribler are primarily responsible for getting torrent information. It turns out that the vast majority of torrents are discovered through the lists of random torrents gossiped within the PopularityCommunity
.
When zooming in on the first 10 minutes, we see that a lot of discovered torrents are discovered by the GigaChannelCommunity
(preview torrents). The effect of preview torrent quickly fades as preview torrents are only collected onup the discovery of new channels.
While examining the database after the experiment had completed, it turned out that a little under 80% of all torrent were "free-for-all" torrents, meaning that they did not have a public_key
/signature
and were put on the network after a user manually requested the torrentinfo.
Also, if you'd look at the top channels, they all were updated 2-3 years ago. That means their creators basically dumped a lot of stuff and then forgot about it, lose interest in it. But their creations continue to dominate the top, which is now a bunch of :zombie: :zombie: :zombie:
"Free-for-all" torrents gossip is the most basic form of collective authoring, and it trumped over the channels in real usage (e.g., search and top torrents) as @egbertbouman demonstrated above. This means that without a collective editing feature Channels are useless and misleading.
Overall, I'd say Channels must be removed (especially because of its clumsy "Channel Torrent" engine), and replaced with some more bottom-up system, e.g. collectively edited tags.
:astonished: :astonished: :astonished: 650 torrents are collected by Tribler in the first 20 seconds after startup?
When zooming in on the first 10 minutes, we see that a lot of discovered torrents are discovered by the GigaChannelCommunity (preview torrents).
Now we understand a big source of performance loss. Gigachannels is way too agressive in the first seconds and first minute of startup. No rate control or limit of IO or networking (32 torrents/second). Should be smoother and first rely on RemoteSearch
results. Tribler will get unresponsive I believe on slow computers with this aggressive content discovery. Great work on the performance analysis! We should do that with all our features.
No fix needed now! Only awareness that we need to monitor our code consistently for performance.
@xoriole would be great to write a scientific arxiv paper on this, beyond traditional developer role. Also contributing to performance evaluation and scientific core. See IPFS example reporting, tooling, and general pubsub work plus "GossipSub: Attack-Resilient Message Propagation in the Filecoin and ETH2.0 Networks".
Problem description, design, Graphs, performance analysis, and vital role for search plus relevance ranking. Timing: finish before Aug 2024 (end contract).
The problem remains relevant to this day. I see two main issues and possible directions for improvement:
Zero-trust architecture is expensive. If I only trust my own node, I need to check all torrents, which is a costly operation. Solution: Enhance the torrent checker with reputation information. Include the source when you receive information about torrents checked by others. When verifying torrent health, cross-check with other reported sources. If the reported information is significantly inaccurate, downgrade the source's reputation. If the information seems "good enough," slightly increase the source's reputation. This approach enables us to have torrent information enhanced with network health checks, not just checks from our own node. Challenge: Torrent health is always dynamic data. We need a reliable estimation to determine what is significantly inaccurate and what is "good enough."
Biased health checks are not optimal. Currently, many peers perform redundant work by repeatedly checking the same popular torrents. Can we improve this process? Perhaps we could integrate information about who has already checked the torrents and how much we trust them. Simplistic gossip mechanism: Randomly send information about 5 popular, checked torrents. While this could lead to repetition and over-redundancy, we need to demonstrate that this is indeed a problem (using a graph). With baseline metrics, we can identify opportunities for further improvement.
@arvidn indicated: tracking popularity is known to be a hard problem.
We deployed the first version into Tribler #3649 , after prior Master thesis research #2783. However, we lack documentation or specification of the deployed protocol.
Key research questions:
Concrete graphs from a single crawl:
Implementation of
on_torrent_health_response(self, source_address, data)
ToDo @xoriole : document deployed algorithm in 20+ lines (swarm check algorithm, pub/sub, hash selection algorithm,handshakes, search integration, etc.).