Revise the `PopularityCommunity` metadata retrieval protocol.

drew2a commented 8 months ago

Despite the protocol's apparent simplicity, PopularityCommunity is quite complex as it derives logic from RemoteQueryCommunity: https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/popularity/community/popularity_community.py#L22-L23

This inheritance was implemented in #5736

The current algorithm for metadata retrieval is as follows: https://github.com/Tribler/tribler/issues/7398#issuecomment-1721263221

If a peer receives torrent health info for a torrent whose metadata is missing, the Popularity Community subsequently requests the missing metadata.

https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/popularity/community/popularity_community.py#L81-L92

As RemoteQueryCommunity is going to be removed in 8.0.0 we have to replace the algorithm for metadata retrieval.

drew2a commented 8 months ago

As a starting point for discussion, I propose the following algorithm:

Popularity Community operates in this manner:

Upon introduction requests, send information about popular torrents.
Every 5 seconds, choose a random peer and send a torrent health request.
The chosen peer responds with a list of health information.

(Note: Steps 1, 2, and 3 remain unchanged from the current algorithm)

The requester identifies torrents for which knowledge is missing and sends a series of messages requesting this knowledge:

@dataclass
class RequestKnowledgeMessage:
    infohash: str

The chosen peer responds with a series of messages containing the required knowledge.

https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/knowledge/community/knowledge_payload.py#L42-L45

synctext commented 8 months ago

My idea is to first focus on stability, removing Gigachannels, keeping tags, and then radically alter the architecture. Lets not try to fix things which are not broken currently :thinking:

- Remove 2 out of 3 methods for content discovery - promote PopularityCommunity as the only way to discover novel hashes - remove channel sampling/pre-view and free-for-all channel mechanism - Remove in GUI and core at some time or hide - Release a stable release with this code - Add a new message inside the PopularityCommunity - backwards compatible with older peers - new feature of shadow keys and Libtorrent ground truth on swarm size - ```Query, swarm-clicked, swarm-not-clicked, swarm-clicked-size-as-seen-by-Libtorrent, date, shadow-signature``` - crawl new info - Web-of-trust: also the rendezvous will get start producing limited crawl data - New privacy-protected ClickLog-based discovery - New release which starts to utilise the new "ContentDiscovery" community and one-struct-to-rule-them-all - Further releases (mixing 4 things all into 1 Tribler hopefully :pray: ) - collecting further data for the Machine Learning Science part - collecting further data for web-of-trust - collecting further data for tag-based metadata enrichment (content,trust, and queries) - end of gigachannels

drew2a commented 8 months ago

Removing 2 out of 3 Content Discovery Methods

During my effort on Friday to eliminate channel sampling/pre-view and the free-for-all channel mechanism, I encountered some obstacles. Even though my initial attempt wasn't successful, I've gained insights into how this can be achieved and can now provide more detailed estimations.

The removal process should begin on the GUI side. This involves:

Removing visual elements associated with these features.
Deleting the corresponding models and widgets.

Once the GUI components are addressed:

The Gigachannel Manager can be deleted.
The Remote Query Community should be detached from the Metadata Store, along with its metadata.db part.
Any parts not related to the search can then be stripped from the Gigachannel Community.
Finally, the Gigachannel Community can be renamed to Search Community, reflecting its sole focus on search.

Following these changes, a majority of the channels will be eliminated. Any remnants can either be adapted or removed in future refactoring stages.

From my current understanding, it can take 1 week to process these steps.

qstokkink commented 8 months ago

* Add a new message inside the PopularityCommunity

  * backwards compatible with older peers
  * new feature of shadow keys and Libtorrent ground truth on swarm size
  * `Query, swarm-clicked, swarm-not-clicked, swarm-clicked-size-as-seen-by-Libtorrent, date, shadow-signature`

For this little part of the master plan I have the following implementation in mind:

[GUI] Store the last search query in the search widget and its associated top-X (top-10?? -> needs to fit in a UDP packet) results by infohash.
[GUI] Whenever a user starts downloading a new torrent with an infohash in the previously-stored results, wait until the download is at least 50% completed and fetch the info tuple (see quoted reply above) to the core.
[CORE] Store the tuples in a/the database and also store a reverse mapping for each torrent (i.e., for a given infohash X store infohash Y that is likely preferrable or X itself).
[CORE] Instead of gossiping random torrents, gossip the more preferrable torrent Y for a randomly sampled torrent X (using the O(1) reverse mapping in the db). Here we use the "shadow identity" instead of a user's real identity and/or pick a received signed record signed by another shadow identity and gossip that.

Once that all works the last remaining step is to update the search results to also make use of the preference relation instead of pure db-based text search.

qstokkink commented 6 months ago

A more detailed design (green blocks include the code to add), capturing some insights since my last post:

arch

Changes:

Because of our endpoint structure, we don't need to touch GUI code - just the endpoints.
Because this new functionality interfaces with different components, it needs to be in a component itself (named UserActivityComponent above).
It is easier to start with the torrent_finished_alert in the first version, instead of waiting for a 50% threshold.

Disclaimer: this is still before writing even a single line of code, the design may change as I implement it.

Tribler / tribler

Revise the `PopularityCommunity` metadata retrieval protocol. #7632