The big migration: from the Channels to the Knowledge Graph

Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery

https://www.tribler.org

GNU General Public License v3.0

4.82k stars 447 forks source link

The big migration: from the Channels to the Knowledge Graph #7398

Closed drew2a closed 9 months ago

drew2a commented 1 year ago

This issue describes the migration plan for the Tribler's metadata from the Channels to the KnowledgeDatabase. See more information here: #6214

The recent history

Release 7.11: that was the first release for #6396
Release 7.12: there we introduced #6708 and #6718

Current state

Currently, we have the 7.13 release candidate that includes a migration from the TagDatabase to the KnowledgeDatabase (as it's a crucial step towards a global knowledge graph): #7070 After this release is successfully launched, we will be ready for the next steps.

The direction for the next step could involve completing the replacement of Channels with the global Knowledge Graph.

Future

Questions should be answered before the big migration:

Spam prevention: Can we use MeritRank for this purpose? (The answer is "No")
What is the best way to disseminate knowledge? Is the current mechanism sufficient? Should we use the PopularityCommunity for this purpose, or continue using the KnowledgeCommunity? ~~Should we use Skip Graph?~~

7.14

Metadata migration: write a migration for the database:mds.db->knowledge.db
Alternative remote search: develop a remote search identical to the search by MDS, disabled by default.

Outcome: tribler network is ready for testing search by knowledge.

7.14-experimental

Enable remote search by Knowledge by default (as a full replacement for the current mechanism).

Outcome: Search quality evaluation based on users' feedback and our own experience.

7.15

Implement MeritRank for the KnowledgeCommunity

Outcome:

The Tribler network is ready for testing 'search by knowledge' with spam prevention.
Search quality evaluation based on users' feedback and our own experience

8.0

Remove channels from GUI
Remove channels from core

Outcome: Tribler network migrated from the Channels to the KnowledgeDatabase.

xoriole commented 1 year ago

Channels are a way of organizing content on Tribler Network. As it is created by a human (and not autogenerated), the way the channels are organized into a hierarchy of directories and torrents are placed on different directories, makes sense at least to the creator of the channel if not for more people who subscribe to the channel. This involves the channel creator's human intelligence in the organization of content which likely required some time and effort.

While moving away from channels to knowledge database, we should try to incorporate this content organization done by the user and not simply erase it.

drew2a commented 1 year ago

Following a discussion with @grimadas, it has become clear that MeritRank cannot be used for spam prevention with the current tags/knowledge.

MeritRank requires peers to interact with each other, whether directly or indirectly. Based on these interactions, peer rank can be calculated, and potential spam can be ignored. As demonstrated by @devos50 in his recent research, Tribler's users are not actively generating tags. Consequently, for the average user, there won't be enough paths in the Knowledge Graph to reach another user and rank their knowledge statements (and, therefore, ignore spam).

Thus, the first problem is that we need to encourage users to interact with each other through metadata or content. It remains an open question as to how we can achieve this.

Another issue is that MeritRank was not designed specifically for spam prevention, and its effectiveness in addressing this particular problem is uncertain. @grimadas has taken the time to contemplate this and suggest the best way for MeritRank to be utilized.

ichorid commented 1 year ago

"Evil fairy" :smiling_imp: :fairy: :laughing: : kindly reminds: "Pull requests" for channels

Also, https://github.com/Tribler/tribler/issues/6043 , https://github.com/Tribler/tribler/issues/6217 etc.

(see https://github.com/Tribler/tribler/issues/6481)

synctext commented 1 year ago

MeritRank requires peers to interact with each other, whether directly or indirectly.

Good point. MeritRank assumes a fully connected interaction graph. We could spend a long time to make a good one. Lets do a minimal viable connectivity graph, recycle what we had operational before. verified neighbors are a great idea, lowest latency peers when bootstrapping, implement this as random sampling during startup, or build an age verification service of this.

update even more minimal minimal design: :eyes: :eye: Signed record that I have seen you :eye: :eyes:

Simply a cryptographic signed record of who chatted with you. Nothing more. Beyond trivial. ~~After you install Tribler you ask peers to give you a signed record of when you are born.~~ These are your earliest records of interacting with IPv8 communities. Very usable as fully connected graph and foundation for multidimensional reputation mechanism?

update: proof-of-rendezvous. No need to complicate it with birth, just proof that you once existed, according to this signed witness statement.

synctext commented 1 year ago

First draft of division of responsibilities

:boom: 15-day counter is running for next RC4 deployment {Tribler 7.14} of an approved "Pong-certificate PR". Feature: I hereby irrefutable certify that I have gotten a message from this public IPv8 identity/key. No zero-knowledge proof yet of past-pongs :rofl: :boom:

Who	What?
@InvictusRMC	sub-DAO deployment and measure [thesis chapter], 2D MeritRank, Tribler PRO [thesis chapter]: donate, MeritMiners, proof-of-donation, social recognition, and preemption
@grimadas	`Pong-certificate for permissionless Level 0` proof-of-merit. complete design specification. Deployment of MeritRank, measure, stability, correctness
@mg98	Metadata enrichment, tagging, Meritrank integration, and measurements [thesis chapter(s)]
@drew2a	general development and quality control

When	What?
June	overall design sketch complete
July	first running IPv8 community with MeritRank
Sep	Test deployment of minimal viable MeritRank
Dec 2023	Production: Merit-Rank based tagging and 2D MeritRank
Dec 2024	Tribler PRO

drew2a commented 1 year ago

What is the best way to disseminate knowledge? Is the current mechanism sufficient?

We discussed these questions with @grimadas, @InvictusRMC and Georgy and came to understand that the current algorithm of knowledge dissemination results in unpredictable search quality (below is the draft solution of how we can fix it).

How the current channels search works

The contemporary full-text search algorithm from the most recent release combines results from local search with remote results of sending non-forwarding requests to 20 random neighbors (total amount of neighbors is 50). It operates quickly, returning relevant results - although these are often related to dead swarms - and it is effective for the search of popular content.

The logic behind the dissemination of channel metadata is intricate. However, the primary methods for acquiring metadata are the following:

All results from each remote search are added to a local database.
Any metadata associated with entities gossiped by the Popularity Community is added to the local database.
Channels, along with their previews, are added to the local database in the background during the channel discovery process.
If a peer subscribes to a channel, it downloads all its content, which is then added to the local database.

Docs: https://tribler.readthedocs.io/en/latest/metadata_store/channels_architecture.htm Source: https://github.com/Tribler/tribler/tree/main/src/tribler/core/components/metadata_store/remote_query_community

Hence, many nodes in the network possess metadata about popular content due to reasons (1) and (2). As a result, searches for popular content often yield quite relevant results (@ichorid correct me if I'm wrong).

Proposed search algorithm for the 7.14 version

For the 7.14 version we can use the same search algorithm but as an information source, we can use Knowledge Graph.

The current algorithm for knowledge dissemination is based on gossip. Information is randomly distributed across the network amongst all peers. This information may not be accessible to a peer when it conducts a search query, as it might not be in its neighborhood or even within the network itself due to the probability of certain peers being offline.

So, nodes with the knowledge about search topic should be in the neighborhood of each node in the network. To reduce disk usage on user machines and network bandwidth, we can restrict this knowledge to only popular torrents. To keep this popular knowledge updated, these nodes could exchange information about search trends. We will refer to these nodes as 'Light Nodes'. By default, all Tribler nodes are 'Light Nodes'.

Light Nodes

All Tribler clients are Light Nodes by default.

Light Nodes store only the minimal information necessary to keep the network responsive. They maintain and compute information about trends, and exchange this data with other nodes.

By 'trends', we refer to popular queries.

API:

search(query: str)->List[QueryResult]
get_trends()->List[Trend]
get_metadata(trends: List[Trend])->List[Statement]

The results returned will be subject to the MeritRank algorithm for ranking and sybil filtering.

There should be an option for a user to switch the type of their node from Light to Full and vice versa.

Full Nodes

Since Light Nodes store only a limited amount of knowledge, they should have access to larger knowledge storage to be able to update their knowledge based on changing trends.

Full Nodes store a comprehensive set of metadata. They are capable of responding to both search and metadata queries.

API:

search(query: str)->List[QueryResult]
get_trends()->List[Trend]
get_metadata(trends: List[Trend])->List[Statement]

The results returned will be subject to the MeritRank algorithm for ranking and sybil filtering.

The discovery of Full Nodes will be conducted in the same manner as Exit Node discovery.

As an option, we could implement a rule requiring that at least one Full Node should be within each node's neighborhood to provide search results on less popular content.

Questions:

What would be the most suitable format for trends? Should we use strings, a list of tokens, or something else?
Why do we use new "trends" instead of the old "popular community"?
What algorithm should we implement for disseminating trends?
How much trend-related data and metadata should we store? Should this be limited by size?
What would be the procedure for transitioning from a Full Node to a Light Node, and vice versa?
What is the ideal number of Full Nodes within each peer neighborhood?
How can we validate the search results of queries (it is necessary for MeritRank to work)?
What strategies can we use to incentivize users to transition to Full Nodes?
If we don't have enough Full Nodes in the network, what's our contingency plan?

synctext commented 1 year ago

Thank you for that solid design sketch for 7.14! Can you make it more minimal and easier please?

Let's leave the long-tail and rare content search problem out. First do a minimal transition from channels to tags. Minimal viable MeritRank spam prevention. Introducing complexities such as full nodes and metadata-servers is best left after we have tags, Meritrank, and existing PopularityCommunity fully working.

For 7.14 we could add simply "rendezvous certificates". Somebody signed that you exist. Temptation is obviously to make this stronger using CRDT plus grow-only counter: Brief Announcement: The Only Undoable CRDTs are Counters. Is that smart @grimadas ?

drew2a commented 1 year ago

@synctext

Can you make it more minimal and easier, please?

Sure. For 7.14, we can proceed as follows:

Migration: Copy all titles from mds.db to knowledge.db using the statement Statement(subject=infohash, predicate="TITLE", object=title)
Gossip: Use PopularityCommunity as the primary source of knowledge gossip (as a byproduct of TorrentHealth gossip)
Search: Apply the same search algorithm to the knowledge as we currently use for the channels

drew2a commented 1 year ago

Updated Plan for Upcoming Releases

7.14

Implement Rendezvous Certificates as the first step toward MeritRank implementation: #7517
Partial metadata migration: #7595
Enhancements in gossip protocol (will be described and discussed later)
Local search which uses using knowledge database
Remote search which uses using knowledge database
Wireframes for the new GUI

The end of this release may not present noticeable changes for regular Tribler users when comparing version 7.14 to 7.13. The primary search engine and metadata source will continue to be Channels. However, with the extension of Tribler's search API, we as developers can test the new search functionality without disrupting the user experience.

8.0

Complete metadata migration
Switch to the new search engine
Transition the GUI from being channels-centric to search-centric

The MeritRank implementation is difficult to plan at this stage as it's not closely integrated with the Knowledge Graph yet. We've decided that spam prevention and reputation features are not priorities for the upcoming releases. As we gather more information on the MeritRank, I'll update the plan accordingly.

drew2a commented 1 year ago

Migration Strategy

For the 7.14 release, we plan to implement a partial metadata migration. During this phase, only titles will be transferred from the mds.db to the knowledge.db (the necessity of the health migration is still up for debate). After the migration is complete, the knowledge.db will contain these types of statements:

Title (new)
Tag
Content Item
Torrent

We have implemented this migration as an extension of the existing KnowledgeRulesProcessor. This approach allows for batch processing of mds.db in the background, making the transition virtually unnoticeable to our users and preventing any disruption to normal Tribler usage. Eventually, all titles present at the time the migration began will be moved to knowledge.db. This will provide us, as developers, with an opportunity to test the search quality in the real network going forward.

For the 8.0 release, we plan to carry out a full metadata migration. In addition to transferring new torrent titles, the health information and the relationships between channel items will also be moved to the knowledge.db. By "relationships," I mean the hierarchical structure inherent in channels, as well as the information that can be derived from this organizational principle, such as nestedness.

The full metadata migration will be integrated into the Upgrader class and will form part of the standard upgrade procedure.

drew2a commented 1 year ago

Enhancements in Gossip Protocol

There's an ongoing discussion about how we should refine the knowledge gossip protocol.

Current Algorithm

As it stands, our current algorithm proceeds as follows:

Every 5 seconds, select a random peer and send a knowledge request to it.
Upon receiving a knowledge request, send 10 random statements to the requesting peer.

This approach was modeled after the Popularity Community gossip algorithm, assuming its long-standing effectiveness would make it a good starting point for the Knowledge Community.

Potential Improvements

One immediate suggestion to enhance the protocol would be to shift the type of items gossiped. Rather than gossiping single statements, we could share comprehensive information about a specific torrent—its title, tags, and content items. This approach would maintain the integrity of the torrent's information, which could become crucial as we start use it in the search. However, this is based on intuition and may warrant further evaluation. For instance, this approach might introduce biases favoring statements for torrents with extensive metadata.

Do We Need a Separate Gossip Protocol?

Another significant question is whether a dedicated gossip protocol for knowledge statements is necessary at all. My inclination is that we might be able to leverage the existing gossip mechanisms implemented in the Popularity Community. Here's a brief overview of how the high-level algorithm in the Popularity Community operates:

Send information about popular torrents upon introduction requests.
Every 5 seconds, select a random peer and send a torrent health request to it.

If a peer receives torrent health info for a torrent whose metadata is missing, the Popularity Community subsequently requests the missing metadata. Given that there is already a mechanism in Tribler for disseminating metadata, it might be sufficient to replace the source of metadata from mds.db to knowledge.db. This change could potentially eliminate the need for a dedicated gossip mechanism in the Knowledge Community.

Ref: https://github.com/Tribler/tribler/issues/3868#issuecomment-1725328884

This method could be synergized with what intuitively appears to be an efficient way to disseminate metadata—namely, including torrent metadata in the Remote Search Results Response. The logic here is straightforward: the more a specific torrent is searched for by Tribler users, the more widely its metadata will be distributed across the network. This mechanism is currently operational within the Channels.

The Experimental Approach vs. The Pragmatic Approach

There's ample opportunity for experimentation here. We could simulate a network with a predefined number of agents, assuming that they have complete and accurate metadata. This would allow us to evaluate the proposed algorithm's impact on search efficiency, network fault tolerance, and the speed of metadata dissemination for both popular and random torrents. Conducting this research rigorously would require the full-time commitment of a dedicated scientist.

Alternatively, we could take a more pragmatic approach by adopting methods that seem effective in existing communities. This would be quicker and easier, but it offers no guarantees that the methods will perform as expected.

drew2a commented 1 year ago

The search algorithm

The most recent version of the Channels Search Algorithm was introduced in #7025.

It contains two parts:

Local Search
Remote Search

Remote search could be readily adapted for use with the Knowledge Community, as it is not dependent on the underlying database structure.

Adapting the existing local search algorithm, on the contrary, is challenging due to differences in database structure.

Here are some potential approaches for implementing search within the Knowledge Graph, along with rough time estimates for each:

Implement Full-Text Search (FTS) for titles using SQL (Estimated time: 3 days).
Adapt the latest algorithm from @kozlovsky, excluding health info (Estimated time: 1 week).
Fully adapt the latest algorithm from @kozlovsky, including health info (Estimated time: 1-3 months).
Develop a completely new algorithm tailored to the graph-based structure of the Knowledge Graph. This approach could necessitate switching from an SQL database to a specialized graph database (Estimated time: Unknown, requires further analysis).

These are rough estimates and the actual time required may vary.

DB scheme

The following lines were extracted from the https://github.com/Tribler/tribler/pull/7608:

@kozlovsky: To implement a fast full text search, we should have an additional entity for the Torrent that combines all fields that we need during the search. I suggest the following entity definition:

    class Torrent(db.Entity):
        id = orm.PrimaryKey(int, auto=True)
        resource = orm.Required(lambda: db.Resource, index=True)
        title = orm.Required(str)  # If several titles are provided, the longest title is used here
        tags = orm.Required(str)  # All tags combined into a single string

        # The most recent health information, duplicated from the last TorrentHealth
        seeders = orm.Required(int, default=0)
        leechers = orm.Required(int, default=0)
        last_check = orm.Required(int, default=0)  # in the format of the UNIX timestamp

Basically, it is a denormalized table that keeps all the necessary information for the full-text search. But it is not necessary to add it right now in this PR; we can do it later in a separate PR.

drew2a commented 1 year ago

DB scheme

The metadata.db has the following entities:

MiscData
TrackerState
TorrentState
ChannelNode
MetadataNode
CollectionNode
TorrentMetadata
ChannelMetadata
JsonNode
ChannelDescription
BinaryNode
ChannelThumbnail
ChannelVote
ChannelPeer
Vsids

They could be divided into 4 groups:

Tracker Info	Health info	Torrent/Channels metadata	Misc data
TrackerState	TorrentState	ChannelNode, MetadataNode, CollectionNode, TorrentMetadata, ChannelMetadata, JsonNode, ChannelDescription, BinaryNode, ChannelThumbnail, ChannelVote, ChannelPeer, Vsids	MiscData

The final two categories (Torrent/Channels metadata and Misc data) can be migrated to knowledge.db without necessitating any structural changes within the knowledge.db itself.

Regarding Tracker info and Health info, I envision three potential solutions:

Treat them as knowledge entities within the knowledge.db.
Maintain them as distinct tables inside the knowledge.db.
Allocate them to a separate database.

From a developmental effort perspective, the first option is the most straightforward. However, this choice comes with two significant disadvantages:

The search speed may considerably decline due to intricate database queries.
Operations such as inserting/retrieving torrent health and tracker information might become more complex and time-consuming.

The second option seems like a middle ground. While it may not impact speed as drastically as the first, it does blur the lines of abstraction. One could then question why certain knowledge elements are designated as Statements, while others are treated as distinct entities within the same database. This might also raise questions about the database's name (knowledge.db) and if a more general name like tribler.db might be more fitting.

The third option is the most abstractly clear, separating different data types into distinct databases. However, it introduces an additional database to the current roster (knowledge.db, bandwidth.db, mds.db (which will eventually be phased out)). In terms of performance, it might not be as efficient as the second option.

synctext commented 1 year ago

No exact database schema definition, TorrentTag was defined here by Martijn in Oct 2021. Our roadmap (Johan job) and detailed designs are not documented in detail, it has to be extracted from multiple issues + PR details or the actual code.

drew2a commented 1 year ago

The exact database schema definition can be found here: https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py#L77-L144

The KnowledgeCommunity message format: https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/knowledge/community/knowledge_payload.py#L9-L21 https://github.com/Tribler/tribler/blob/20fb22453c8c5349c6de4b8e679a5b590792b7df/src/tribler/core/components/knowledge/community/knowledge_payload.py#L43-L45

synctext commented 1 year ago

Complexity is the biggest risk of Tribler failure, hence my obsession with it. Quick list of open issues and priorities (ongoing editing). I'm struggling what to do first. We should collectively think more about this. We need a clear roadmap. A clear division of responsibility for each team member in this roadmap. Conflicting priorities:

Improve stability. Fix or close the 150+ open Sentry issues. Actual faults reported by actual users.
Learn-to-rank distributed machine learning (@pneague @mg98) -. great disruption of the Tribler code, this changes everything (impact stability) -. make it as simple as possible for integration, leave improvements and tweaks for future. -. gossip learning into production (world-first innovation, bridging the gap between academic research from 2011 and the real-world of 2024 in only 13 years :rofl:) -. Future steps privacy enhancement, security based on meritrank
new socket and rust module (@egbertbouman)
removal of GigaChannel
"soft" migration of knowledge DB ('might take until summer', this is what triggered my reaction. We're a tiny dev team, solid scheduling is needed).
Finally, distributed trust. Reward past effort with priority service and social recognition. Tried many times since 1998 (PageRank patents), always failed. (@InvictusRMC @grimadas)
{update} Really final is a new threading model. Async model of small tasks on single thread conflicts with huge few second rare DB queries. Blocks network socket :boom:

(new architecture picture also discussed in this PopularityCommunity issue)

drew2a commented 11 months ago

The roadmap from the most recent dev meeting:

Shipment of 7.13.1. Awaiting closure of the following issues/PRs:
- 7660
- 7628
- 7647
Shipment of 7.14 (Rendezvous certificates) before Christmas. The flagship feature: #7630.
Shipment of 7.15 (Channels Retirement). The flagship feature: #7669
Shipment of 8.0, with the new Tribler architecture: https://github.com/Tribler/tribler/issues/7632#issuecomment-1778787961.

drew2a commented 9 months ago

As the migration is not going to happen (we decided to retain some parts of channels, including the metadata.db), I'm closing the issue.

7669

synctext commented 9 months ago

One brainstorm idea to complete the migration is to also provide tooling for format compatibility and platform compatibility. Youtube Creative Commons content with metadata in the form of tags could be exported and made compatible with tooling of the Bittorrent swarm world. (TikTok compability is still out of scope) ToDo @drew2a : have a 1-2page design for this together with @mg98. Then we have our own perfect metadata dataset. Goal is perfect metadata + Bittorrent + permissionless augmenting of Knowledge graph + MeritRank. Goal2: have a seedbox with 1 TByte of Bittorrent swarms+perfect metadata. Use if possible the Youtube 8M dataset by Google which contains: The (multiple) labels per video are Knowledge Graph entities, organized into 24 top-level verticals. Each entity represents a semantic topic that is visually recognizable in video, and the video labels reflect the main topics of each video. license: The dataset is made available by Google LLC. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Tribler / tribler

The big migration: from the Channels to the Knowledge Graph #7398

The recent history

Current state

Future

How the current channels search works

Proposed search algorithm for the 7.14 version

Light Nodes

Full Nodes

7.14

8.0

Migration Strategy

Enhancements in Gossip Protocol

Current Algorithm

Potential Improvements

Do We Need a Separate Gossip Protocol?

The Experimental Approach vs. The Pragmatic Approach

The search algorithm

DB scheme

DB scheme

7660

7628

7647

7669