Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.74k stars 445 forks source link

Revise the `PopularityCommunity` metadata retrieval protocol. #7632

Open drew2a opened 8 months ago

drew2a commented 8 months ago

Despite the protocol's apparent simplicity, PopularityCommunity is quite complex as it derives logic from RemoteQueryCommunity: https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/popularity/community/popularity_community.py#L22-L23

This inheritance was implemented in #5736

The current algorithm for metadata retrieval is as follows: https://github.com/Tribler/tribler/issues/7398#issuecomment-1721263221

If a peer receives torrent health info for a torrent whose metadata is missing, the Popularity Community subsequently requests the missing metadata.

https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/popularity/community/popularity_community.py#L81-L92

As RemoteQueryCommunity is going to be removed in 8.0.0 we have to replace the algorithm for metadata retrieval.

drew2a commented 8 months ago

As a starting point for discussion, I propose the following algorithm:

Popularity Community operates in this manner:

  1. Upon introduction requests, send information about popular torrents.
  2. Every 5 seconds, choose a random peer and send a torrent health request.
  3. The chosen peer responds with a list of health information.

(Note: Steps 1, 2, and 3 remain unchanged from the current algorithm)

  1. The requester identifies torrents for which knowledge is missing and sends a series of messages requesting this knowledge:
@dataclass
class RequestKnowledgeMessage:
    infohash: str
  1. The chosen peer responds with a series of messages containing the required knowledge.

https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/knowledge/community/knowledge_payload.py#L42-L45

synctext commented 8 months ago

My idea is to first focus on stability, removing Gigachannels, keeping tags, and then radically alter the architecture. Lets not try to fix things which are not broken currently :thinking:

- Remove 2 out of 3 methods for content discovery - promote PopularityCommunity as the only way to discover novel hashes - remove channel sampling/pre-view and free-for-all channel mechanism - Remove in GUI and core at some time or hide - Release a stable release with this code - Add a new message inside the PopularityCommunity - backwards compatible with older peers - new feature of shadow keys and Libtorrent ground truth on swarm size - ```Query, swarm-clicked, swarm-not-clicked, swarm-clicked-size-as-seen-by-Libtorrent, date, shadow-signature``` - crawl new info - Web-of-trust: also the rendezvous will get start producing limited crawl data - New privacy-protected ClickLog-based discovery - New release which starts to utilise the new "ContentDiscovery" community and one-struct-to-rule-them-all - Further releases (mixing 4 things all into 1 Tribler hopefully :pray: ) - collecting further data for the Machine Learning Science part - collecting further data for web-of-trust - collecting further data for tag-based metadata enrichment (content,trust, and queries) - end of gigachannels
drew2a commented 8 months ago

Removing 2 out of 3 Content Discovery Methods

During my effort on Friday to eliminate channel sampling/pre-view and the free-for-all channel mechanism, I encountered some obstacles. Even though my initial attempt wasn't successful, I've gained insights into how this can be achieved and can now provide more detailed estimations.

The removal process should begin on the GUI side. This involves:

Once the GUI components are addressed:

Following these changes, a majority of the channels will be eliminated. Any remnants can either be adapted or removed in future refactoring stages.

From my current understanding, it can take 1 week to process these steps.

qstokkink commented 8 months ago
* Add a new message inside the PopularityCommunity

  * backwards compatible with older peers
  * new feature of shadow keys and Libtorrent ground truth on swarm size
  * `Query, swarm-clicked, swarm-not-clicked, swarm-clicked-size-as-seen-by-Libtorrent, date, shadow-signature`

For this little part of the master plan I have the following implementation in mind:

Once that all works the last remaining step is to update the search results to also make use of the preference relation instead of pure db-based text search.

qstokkink commented 6 months ago

A more detailed design (green blocks include the code to add), capturing some insights since my last post:

arch

Changes:

Disclaimer: this is still before writing even a single line of code, the design may change as I implement it.