Closed drew2a closed 9 months ago
Channels are a way of organizing content on Tribler Network. As it is created by a human (and not autogenerated), the way the channels are organized into a hierarchy of directories and torrents are placed on different directories, makes sense at least to the creator of the channel if not for more people who subscribe to the channel. This involves the channel creator's human intelligence in the organization of content which likely required some time and effort.
While moving away from channels to knowledge database, we should try to incorporate this content organization done by the user and not simply erase it.
Following a discussion with @grimadas, it has become clear that MeritRank cannot be used for spam prevention with the current tags/knowledge.
MeritRank requires peers to interact with each other, whether directly or indirectly. Based on these interactions, peer rank can be calculated, and potential spam can be ignored. As demonstrated by @devos50 in his recent research, Tribler's users are not actively generating tags. Consequently, for the average user, there won't be enough paths in the Knowledge Graph to reach another user and rank their knowledge statements (and, therefore, ignore spam).
Thus, the first problem is that we need to encourage users to interact with each other through metadata or content. It remains an open question as to how we can achieve this.
Another issue is that MeritRank was not designed specifically for spam prevention, and its effectiveness in addressing this particular problem is uncertain. @grimadas has taken the time to contemplate this and suggest the best way for MeritRank to be utilized.
"Evil fairy" :smiling_imp: :fairy: :laughing: : kindly reminds: "Pull requests" for channels
Also, https://github.com/Tribler/tribler/issues/6043 , https://github.com/Tribler/tribler/issues/6217 etc.
MeritRank requires peers to interact with each other, whether directly or indirectly.
Good point. MeritRank assumes a fully connected interaction graph. We could spend a long time to make a good one. Lets do a minimal viable connectivity graph, recycle what we had operational before. verified neighbors are a great idea, lowest latency peers when bootstrapping, implement this as random sampling during startup, or build an age verification service of this.
update even more minimal minimal design: :eyes: :eye: Signed record that I have seen you :eye: :eyes:
Simply a cryptographic signed record of who chatted with you. Nothing more. Beyond trivial. After you install Tribler you ask peers to give you a signed record of when you are born. These are your earliest records of interacting with IPv8 communities. Very usable as fully connected graph and foundation for multidimensional reputation mechanism?
update: proof-of-rendezvous. No need to complicate it with birth, just proof that you once existed, according to this signed witness statement.
First draft of division of responsibilities
:boom: 15-day counter is running for next RC4 deployment {Tribler 7.14} of an approved "Pong-certificate PR". Feature: I hereby irrefutable certify that I have gotten a message from this public IPv8 identity/key. No zero-knowledge proof yet of past-pongs :rofl: :boom:
Who | What? |
---|---|
@InvictusRMC | sub-DAO deployment and measure [thesis chapter], 2D MeritRank, Tribler PRO [thesis chapter]: donate, MeritMiners, proof-of-donation, social recognition, and preemption |
@grimadas | Pong-certificate for permissionless Level 0 proof-of-merit. complete design specification. Deployment of MeritRank, measure, stability, correctness |
@mg98 | Metadata enrichment, tagging, Meritrank integration, and measurements [thesis chapter(s)] |
@drew2a | general development and quality control |
When | What? |
---|---|
June | overall design sketch complete |
July | first running IPv8 community with MeritRank |
Sep | Test deployment of minimal viable MeritRank |
Dec 2023 | Production: Merit-Rank based tagging and 2D MeritRank |
Dec 2024 | Tribler PRO |
What is the best way to disseminate knowledge? Is the current mechanism sufficient?
We discussed these questions with @grimadas, @InvictusRMC and Georgy and came to understand that the current algorithm of knowledge dissemination results in unpredictable search quality (below is the draft solution of how we can fix it).
The contemporary full-text search algorithm from the most recent release combines results from local search with remote results of sending non-forwarding requests to 20
random neighbors (total amount of neighbors is 50
). It operates quickly, returning relevant results - although these are often related to dead swarms - and it is effective for the search of popular content.
The logic behind the dissemination of channel metadata is intricate. However, the primary methods for acquiring metadata are the following:
Docs: https://tribler.readthedocs.io/en/latest/metadata_store/channels_architecture.htm Source: https://github.com/Tribler/tribler/tree/main/src/tribler/core/components/metadata_store/remote_query_community
Hence, many nodes in the network possess metadata about popular content due to reasons (1) and (2). As a result, searches for popular content often yield quite relevant results (@ichorid correct me if I'm wrong).
For the 7.14
version we can use the same search algorithm but as an information source, we can use Knowledge Graph.
The current algorithm for knowledge dissemination is based on gossip. Information is randomly distributed across the network amongst all peers. This information may not be accessible to a peer when it conducts a search query, as it might not be in its neighborhood or even within the network itself due to the probability of certain peers being offline.
So, nodes with the knowledge about search topic should be in the neighborhood of each node in the network. To reduce disk usage on user machines and network bandwidth, we can restrict this knowledge to only popular torrents. To keep this popular knowledge updated, these nodes could exchange information about search trends. We will refer to these nodes as 'Light Nodes'. By default, all Tribler nodes are 'Light Nodes'.
All Tribler clients are Light Nodes by default.
Light Nodes store only the minimal information necessary to keep the network responsive. They maintain and compute information about trends, and exchange this data with other nodes.
By 'trends', we refer to popular queries.
API:
search(query: str)->List[QueryResult]
get_trends()->List[Trend]
get_metadata(trends: List[Trend])->List[Statement]
The results returned will be subject to the MeritRank
algorithm for ranking and sybil filtering.
There should be an option for a user to switch the type of their node from Light to Full and vice versa.
Since Light Nodes store only a limited amount of knowledge, they should have access to larger knowledge storage to be able to update their knowledge based on changing trends.
Full Nodes store a comprehensive set of metadata. They are capable of responding to both search and metadata queries.
API:
search(query: str)->List[QueryResult]
get_trends()->List[Trend]
get_metadata(trends: List[Trend])->List[Statement]
The results returned will be subject to the MeritRank
algorithm for ranking and sybil filtering.
The discovery of Full Nodes will be conducted in the same manner as Exit Node discovery.
As an option, we could implement a rule requiring that at least one Full Node should be within each node's neighborhood to provide search results on less popular content.
Questions:
Thank you for that solid design sketch for 7.14! Can you make it more minimal and easier please?
Let's leave the long-tail and rare content search problem out. First do a minimal transition from channels to tags. Minimal viable MeritRank spam prevention. Introducing complexities such as full nodes and metadata-servers is best left after we have tags, Meritrank, and existing PopularityCommunity fully working.
For 7.14 we could add simply "rendezvous certificates". Somebody signed that you exist. Temptation is obviously to make this stronger using CRDT plus grow-only counter: Brief Announcement: The Only Undoable CRDTs are Counters. Is that smart @grimadas ?
@synctext
Can you make it more minimal and easier, please?
Sure. For 7.14, we can proceed as follows:
Statement(subject=infohash, predicate="TITLE", object=title)
Updated Plan for Upcoming Releases
The end of this release may not present noticeable changes for regular Tribler users when comparing version 7.14 to 7.13. The primary search engine and metadata source will continue to be Channels. However, with the extension of Tribler's search API, we as developers can test the new search functionality without disrupting the user experience.
The MeritRank implementation is difficult to plan at this stage as it's not closely integrated with the Knowledge Graph yet. We've decided that spam prevention and reputation features are not priorities for the upcoming releases. As we gather more information on the MeritRank, I'll update the plan accordingly.
For the 7.14
release, we plan to implement a partial metadata migration. During this phase, only titles will be transferred from the mds.db
to the knowledge.db
(the necessity of the health migration is still up for debate). After the migration is complete, the knowledge.db
will contain these types of statements:
We have implemented this migration as an extension of the existing KnowledgeRulesProcessor
. This approach allows for batch processing of mds.db
in the background, making the transition virtually unnoticeable to our users and preventing any disruption to normal Tribler usage. Eventually, all titles present at the time the migration began will be moved to knowledge.db
. This will provide us, as developers, with an opportunity to test the search quality in the real network going forward.
For the 8.0
release, we plan to carry out a full metadata migration. In addition to transferring new torrent titles, the health information and the relationships between channel items will also be moved to the knowledge.db
. By "relationships," I mean the hierarchical structure inherent in channels, as well as the information that can be derived from this organizational principle, such as nestedness.
The full metadata migration will be integrated into the Upgrader class and will form part of the standard upgrade procedure.
There's an ongoing discussion about how we should refine the knowledge gossip protocol.
As it stands, our current algorithm proceeds as follows:
This approach was modeled after the Popularity Community gossip algorithm, assuming its long-standing effectiveness would make it a good starting point for the Knowledge Community.
One immediate suggestion to enhance the protocol would be to shift the type of items gossiped. Rather than gossiping single statements, we could share comprehensive information about a specific torrent—its title, tags, and content items. This approach would maintain the integrity of the torrent's information, which could become crucial as we start use it in the search. However, this is based on intuition and may warrant further evaluation. For instance, this approach might introduce biases favoring statements for torrents with extensive metadata.
Another significant question is whether a dedicated gossip protocol for knowledge statements is necessary at all. My inclination is that we might be able to leverage the existing gossip mechanisms implemented in the Popularity Community. Here's a brief overview of how the high-level algorithm in the Popularity Community operates:
If a peer receives torrent health info for a torrent whose metadata is missing, the Popularity Community subsequently requests the missing metadata. Given that there is already a mechanism in Tribler for disseminating metadata, it might be sufficient to replace the source of metadata from mds.db
to knowledge.db
. This change could potentially eliminate the need for a dedicated gossip mechanism in the Knowledge Community.
Ref: https://github.com/Tribler/tribler/issues/3868#issuecomment-1725328884
This method could be synergized with what intuitively appears to be an efficient way to disseminate metadata—namely, including torrent metadata in the Remote Search Results Response. The logic here is straightforward: the more a specific torrent is searched for by Tribler users, the more widely its metadata will be distributed across the network. This mechanism is currently operational within the Channels.
There's ample opportunity for experimentation here. We could simulate a network with a predefined number of agents, assuming that they have complete and accurate metadata. This would allow us to evaluate the proposed algorithm's impact on search efficiency, network fault tolerance, and the speed of metadata dissemination for both popular and random torrents. Conducting this research rigorously would require the full-time commitment of a dedicated scientist.
Alternatively, we could take a more pragmatic approach by adopting methods that seem effective in existing communities. This would be quicker and easier, but it offers no guarantees that the methods will perform as expected.
The most recent version of the Channels Search Algorithm was introduced in #7025.
It contains two parts:
Remote search could be readily adapted for use with the Knowledge Community, as it is not dependent on the underlying database structure.
Adapting the existing local search algorithm, on the contrary, is challenging due to differences in database structure.
Here are some potential approaches for implementing search within the Knowledge Graph, along with rough time estimates for each:
These are rough estimates and the actual time required may vary.
The following lines were extracted from the https://github.com/Tribler/tribler/pull/7608:
@kozlovsky: To implement a fast full text search, we should have an additional entity for the Torrent that combines all fields that we need during the search. I suggest the following entity definition:
class Torrent(db.Entity):
id = orm.PrimaryKey(int, auto=True)
resource = orm.Required(lambda: db.Resource, index=True)
title = orm.Required(str) # If several titles are provided, the longest title is used here
tags = orm.Required(str) # All tags combined into a single string
# The most recent health information, duplicated from the last TorrentHealth
seeders = orm.Required(int, default=0)
leechers = orm.Required(int, default=0)
last_check = orm.Required(int, default=0) # in the format of the UNIX timestamp
Basically, it is a denormalized table that keeps all the necessary information for the full-text search. But it is not necessary to add it right now in this PR; we can do it later in a separate PR.
The metadata.db
has the following entities:
They could be divided into 4 groups:
Tracker Info | Health info | Torrent/Channels metadata | Misc data |
---|---|---|---|
TrackerState | TorrentState | ChannelNode, MetadataNode, CollectionNode, TorrentMetadata, ChannelMetadata, JsonNode, ChannelDescription, BinaryNode, ChannelThumbnail, ChannelVote, ChannelPeer, Vsids | MiscData |
The final two categories (Torrent/Channels metadata
and Misc data
) can be migrated to knowledge.db
without necessitating any structural changes within the knowledge.db
itself.
Regarding Tracker info
and Health info
, I envision three potential solutions:
knowledge.db
.knowledge.db
.From a developmental effort perspective, the first option is the most straightforward. However, this choice comes with two significant disadvantages:
The second option seems like a middle ground. While it may not impact speed as drastically as the first, it does blur the lines of abstraction. One could then question why certain knowledge elements are designated as Statements, while others are treated as distinct entities within the same database. This might also raise questions about the database's name (knowledge.db
) and if a more general name like tribler.db
might be more fitting.
The third option is the most abstractly clear, separating different data types into distinct databases. However, it introduces an additional database to the current roster (knowledge.db, bandwidth.db, mds.db (which will eventually be phased out)). In terms of performance, it might not be as efficient as the second option.
No exact database schema definition, TorrentTag
was defined here by Martijn in Oct 2021.
Our roadmap (Johan job) and detailed designs are not documented in detail, it has to be extracted from multiple issues + PR details or the actual code.
The exact database schema definition can be found here: https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py#L77-L144
The KnowledgeCommunity
message format:
https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/knowledge/community/knowledge_payload.py#L9-L21
https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/knowledge/community/knowledge_payload.py#L43-L45
Complexity is the biggest risk of Tribler failure, hence my obsession with it. Quick list of open issues and priorities (ongoing editing). I'm struggling what to do first. We should collectively think more about this. We need a clear roadmap. A clear division of responsibility for each team member in this roadmap. Conflicting priorities:
(new architecture picture also discussed in this PopularityCommunity issue)
The roadmap from the most recent dev meeting:
7.13.1
. Awaiting closure of the following issues/PRs:
7.14
(Rendezvous certificates) before Christmas. The flagship feature: #7630.7.15
(Channels Retirement). The flagship feature: #76698.0
, with the new Tribler architecture: https://github.com/Tribler/tribler/issues/7632#issuecomment-1778787961.As the migration is not going to happen (we decided to retain some parts of channels, including the metadata.db
), I'm closing the issue.
Related:
One brainstorm idea to complete the migration is to also provide tooling for format compatibility and platform compatibility. Youtube Creative Commons content with metadata in the form of tags could be exported and made compatible with tooling of the Bittorrent swarm world. (TikTok compability is still out of scope)
ToDo @drew2a : have a 1-2page design for this together with @mg98. Then we have our own perfect metadata dataset. Goal is perfect metadata + Bittorrent + permissionless augmenting of Knowledge graph + MeritRank. Goal2: have a seedbox with 1 TByte of Bittorrent swarms+perfect metadata. Use if possible the Youtube 8M dataset by Google which contains: The (multiple) labels per video are Knowledge Graph entities, organized into 24 top-level verticals. Each entity represents a semantic topic that is visually recognizable in video, and the video labels reflect the main topics of each video.
license: The dataset is made available by Google LLC. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This issue describes the migration plan for the Tribler's metadata from the Channels to the KnowledgeDatabase. See more information here: #6214
The recent history
Current state
Currently, we have the 7.13 release candidate that includes a migration from the
TagDatabase
to theKnowledgeDatabase
(as it's a crucial step towards a global knowledge graph): #7070 After this release is successfully launched, we will be ready for the next steps.The direction for the next step could involve completing the replacement of Channels with the global Knowledge Graph.
Future
Questions should be answered before the big migration:
Should we use Skip Graph?7.14
mds.db
->knowledge.db
Outcome: tribler network is ready for testing search by knowledge.
7.14-experimental
Outcome: Search quality evaluation based on users' feedback and our own experience.
7.15
MeritRank
for theKnowledgeCommunity
Outcome:
8.0
Outcome: Tribler network migrated from the Channels to the KnowledgeDatabase.