Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.8k stars 444 forks source link

Tag based content discovery and crowdsourcing #6214

Closed ichorid closed 4 weeks ago

ichorid commented 3 years ago

After a thorough discussion, we came to the following architecture for the Tags system:

grimadas commented 3 years ago

Comments regarding the tag-based organisation and why need this in the first place:

Why tag-centric organisation?

What is tag-centric even?

Tag-centric organization is people, content, metadata.

The vision for Tribler:

synctext commented 3 years ago

This issue re-does the work already conducted 14 years ago. But then it failed. This is a big task, beyond a single developer. Read about this architecture full article: image The exact hard-core science details: image Complete phd thesis of 202 pages with many details. Key piece that is missing: the post-mortem of why this failed to move beyond prototyping. Sadly that's only in my head.

Our 'recent' tag work of only 10 years ago. High performance implementation of tag-based discovery: https://dl.acm.org/doi/abs/10.1145/2063576.2063852 image

drew2a commented 3 years ago

Draft Wireframes for tags (updated 10.09.21):

Tags drawio

Source: https://drew2a.notion.site/Tags-c1567365a1c94ce78271257f0aa19b06

synctext commented 3 years ago

Solid progress. The tag suggestion box is indeed the hardest to make. In years it needs to evolve to understand/expand synonyms like stackoverflow. https://stackoverflow.com/tags/synonyms?tab=Renames&dir=Descending&filter=Active

devos50 commented 3 years ago

Trying out something new here. When it comes to interaction design, it is common to work with user stories that help to think about the user. Even after writing the simple user stories below, I feel that they are very helpful in getting rid of the bias towards the developer that we most likely all have. Feel free to improve/give feedback on the following user stories.

Personas

I can think of the two personas:

Note: each user story should be clear, feasible, and testable, also see this article. Note 2: These user stories exclusively focus on tag visibility and moderation, and exclude stories for tag navigation and searching

devos50 commented 3 years ago

After some discussions and mock-ups, here's a GUI preview of the resolution of user story 1:

Schermafbeelding 2021-09-28 om 15 31 01

Note that I use the "GUI test mode" for prototyping so the tags/titles do not make sense yet. I'm hovering over the 'edit' button in the first row. Clicking on the pencil will bring up a dialog where a user can suggest/remove tags (we reached majority consensus on using a dialog), but that dialog is not ready yet. Color scheme/margins/paddings/sizes have not been finalized yet.

devos50 commented 3 years ago

We made a few design decisions:

devos50 commented 3 years ago

To resolve user stories 1 and 2, we made the following dialog (still very subject to change):

Schermafbeelding 2021-09-29 om 16 34 41

Update: added an additional setting to disable tag visibility (checkbox is disabled by default):

Schermafbeelding 2021-09-30 om 15 57 14
synctext commented 3 years ago

Something to think about. Do we want to build a community of "taggers" after launch of 7.11? Or let our users know using TorrentFreak after a few months of more iterations and improvements. The following Creative Commons music tags are available for bootstrap purposes (see other Tribler music issues):

devos50 commented 3 years ago

Do we want to build a community of "taggers" after launch of 7.11? Or let our users know using TorrentFreak after a few months of more iterations and improvements.

Building a community would be great, but would probably require more work beyond a minimal version (e.g., making the contributions of a particular user visible). So let's first iterate and improve the current system.

drew2a commented 2 years ago

Design decisions behind the DB:

  1. We will use a separate database: https://github.com/drew2a/tribler/blob/feature/tag_community/src/tribler-core/tribler_core/components/tag/db/tag_db.py
  2. We will store not only tags but also user operations on tags (only one operation per one user):

        class TorrentTagOp(db.Entity):
            id = orm.PrimaryKey(int, auto=True)
    
            torrent_tag = orm.Required(lambda: TorrentTag)
            peer = orm.Required(lambda: Peer)
    
            operation = orm.Required(int)
            time = orm.Required(int)
            signature = orm.Required(bytes)
            updated_at = orm.Required(datetime.datetime, default=datetime.datetime.utcnow)
    
            orm.composite_key(torrent_tag, peer)
  3. We will use two counters to access information about added and removed tags without aggregation:
        class TorrentTag(db.Entity):
            ...
            added_count = orm.Required(int, default=0)
            removed_count = orm.Required(int, default=0)
  4. We will use an indicator to distinguish local user operations and remote user operations:
        class TorrentTag(db.Entity):
            ...
            local_operation = orm.Optional(int) 

cc: @kozlovsky

ichorid commented 2 years ago

@kozlovsky , wouldn't using a separate DB make it impossible to do complex queries involving both Metadata store data and Tags data?

kozlovsky commented 2 years ago

@ichorid I think that with a separate tag database the development of an initial version of the tag-based system may be easier. Regarding queries, with our current approach for FTS search, it should be no difference between a single database and two separate databases. If it would be necessary, we can combine databases later, or even just attach the tag database to the metadata store DB.

devos50 commented 2 years ago

Tag Reinforcement

Not sure if the suggestion below is applicable/suitable for the first version, but it is open for discussion.

Problem: To address the most trivial poisoning attacks, we decided that a particular tag will only be displayed when two identities have suggested it (thresholding). However, the chance that two users independently come up with the same tags for the same content is rather low. Even with a threshold of 2, I predict that much content will remain visibly untagged.

Potential solution: We can help the user by showing tags that have been suggested by other users but don't have enough support yet (i.e., haven't reached the threshold). This indication (e.g., "Suggestions: X, Y, Z" or "Suggested by others: A, B, C") should be part of the dialog where a user can add/remove tags, for example, below the input field. To prevent visual clutter, we should limit the number of suggestions shown.

qstokkink commented 2 years ago

If you need inspiration, there have been some academic works that look at the tag reinforcement of user-generated tags in the Steam Tags system (e.g., http://dx.doi.org/10.1145/3377290.3377300).

devos50 commented 2 years ago

I opened a PR to existing PR #6396 with the implementation of the GUI elements, the REST API endpoints, and additional tests.

drew2a commented 2 years ago

Design decisions behind the Community:

  1. Tag operations gossiped transitively across the network
  2. Pull-based gossip is used
  3. For tags' integrity check ipv8 signatures are used

The simplified algorithm is: "Request 10 random tags from a random peer every 5 seconds"

devos50 commented 2 years ago

While #6396 is being finished, I opened a parallel PR with the GUI elements.

devos50 commented 2 years ago

Weak confirmation of tags

Probably not for the first version of this system but nonetheless an interesting topic for discussion/thought.

Assume that an item contains tags A, B, and C. A user removes tag C and adds tag D. We now create two new messages that will be spread across the network, one for the removal of tag C and one for the addition of tag D. While this is a viable approach, it hides potentially valuable information, namely that the user might believe that tags A and B correctly describe the item. I can think of two reasons why a user would leave particular tags as they are (while editing other ones):

Because of this uncertainty, I refer to this phenomena as a weak confirmation of tags. In particular, weak confirmation of tags happens when a user leaves particular tags as they are while editing other tags.

At some point, we could consider adding this information to the messages we spread around. If many users are leaving tags A and B, they can get more “finalized” and could be harder to remove.

devos50 commented 2 years ago

To make sure that content with tags that overflow the input field can also be edited in a user-friendly manner, I've implemented multi-line editing. The height of the input field automatically expands/shrinks as tags are being added or removed.

Schermafbeelding 2021-10-16 om 13 24 05

devos50 commented 2 years ago

Stack Overflow is using a separate mechanism to build a database of tag synonyms. Users with a certain reputation can suggest tag synonyms. Other users can vote (+1 or -1) on these synonyms. If a synonym proposal reaches a score of 4, it is accepted. If it reaches a score of -2, it is removed. Also see here.

Schermafbeelding 2021-10-16 om 14 44 03

devos50 commented 2 years ago

With #6396 and #6453 merged, an initial version of the tags system is now ready for deployment 🥂 . The design decisions behind this first iteration can be found below.

Design decisions (v0.1)

Core (database, community and REST API)

Design decisions behind the Community:

Design decisions behind the DB:

GUI

Schermafbeelding 2021-10-15 om 10 18 38

Next steps

Also based on feedback received from others, we have discussed the following:

Update: for the time being, we decided to choose the "alter ego" approach where we use a secondary key to authenticate messages (see #6462).

drew2a commented 2 years ago

Based on today's meeting we decide to use the following threshold logic:

score = added - removed

if score <= -2:
    hide()

if -2 < score < 2:
    suggest()

if score >= 2:
    show()

And use clocks (lamport-like) for tag operations based on triplets {public_key, tag, infohash}

devos50 commented 2 years ago

Exploring Content Moderation in the Decentralised Web: The Pleroma Case

In this paper, we present the first study of federation policies on the DW, their in-the-wild usage, and their impact on users. We identify how these policies may negatively impact “innocent” users and outline possible solutions to avoid this problem in the future.

Schermafbeelding 2021-10-28 om 10 16 03

drew2a commented 2 years ago

The local search by tags has been implemented in https://github.com/Tribler/tribler/pull/6617 Tags within the search query should start from the # character.

image

It extends the current search endpoint by adding the tags keyword.

As a further direction, this search can be easily switched to a remote search by extending GigaChannelCommunity (it will require adding just a few lines of code).

drew2a commented 2 years ago

I started to implement remote search by tags by extending GigaChannelCommunity but my original assumption "it will require adding just a few lines of code" was wrong.

The processing of remote queries works in such a way that it is not tolerated for any extensions (it is not backward compatible for extensions):

https://github.com/Tribler/tribler/blob/d8cf3925a92b9063b01f238674a6df3b7f0018c8/src/tribler-core/tribler_core/components/metadata_store/remote_query_community/remote_query_community.py#L184-L192

This happens because of the implementation of sanitize_query https://github.com/Tribler/tribler/blob/d8cf3925a92b9063b01f238674a6df3b7f0018c8/src/tribler-core/tribler_core/components/metadata_store/remote_query_community/remote_query_community.py#L25-L41

Therefore I need to find another approach for implementing remote search.

drew2a commented 2 years ago

After the discussion with @ichorid I decided to continue work on remote search by tags as an extension of GigaChannelCommunity.

The reason to continue: search by tags will not be backward capable with previous versions of Tribler, but it doesn't affect users and it is safe to implement.

The remote search has been implemented and merged in #6708

devos50 commented 2 years ago

The first iteration of the tags mechanism has been running successfully for over two weeks now. I wrote a few scripts to analyse my local tags database (source code can be found here) and to see if there are any interesting results. In total, I analysed 1476 tag operations, created by 52 unique users. Of these tags, 1460 are “add” operations and 16 are “delete” operations. A total of 500 torrents have been tagged. I think user engagement with the tag mechanism is somewhat low even though we made tags a prominent part of the GUI. Maybe the fact that there are no tags visible yet is a main reason? It also hints at that it would be helpful to look into more automated approaches to generate missing tags since currently, an absolute minimal fraction of our content space has been tagged. A total of 34 tags are currently visible in the GUI, meaning that they have been suggested by at least two unique users. This is merely 2.36% of all added tags.

First, I looked at the frequency of each tag, see the log-log plot below. Note that this plot hints at a power-law distribution of tag frequencies. There are a few popular tags which seem to be generic. Most of the tags, however, are unique and only used once or twice. These results are in line with prior work on social tagging systems.

tag_frequencies

The log-log plot below shows how active users have been tagging content. For each user, we show the number of tags it has created. Again, observe the power-law distribution. We have one user that is quite active and added over 1000 tags. Most users have only created a few tag.

user_to_tag_frequencies

Finally, let’s have a look at the distribution of content being tagged. The log-log plot attached shows for each content item how many tags it has received. There is one piece of content that is subject to 20 tags operations. Most content has received four tags. It is important to keep in mind that popular content is more likely to be tagged and as such, there might be a correlation between torrent popularity and the number of tags it has received.

torrent_to_tag_frequencies

While we still have a relatively small dataset, it is interesting to look at and interpret the results above.

devos50 commented 2 years ago

Proposed Tags Roadmap for Q1/Q2 2022

Two weeks ago, we have successfully deployed the first iteration of our tags mechanism. In this first version, users can add and remove tags to and from individual torrents. To address spam, we have implemented a threshold where a tag is only shown when it is suggested by two distinct users.

Today, @drew2a and I discussed the next steps and upcoming improvements. The main focus points are spam and attack-resilience, and automated approaches to generate accurate tags. These two focus points are not mutually exclusive, but we have devised a roadmap for the coming months that allows us to implement the proposed changes incrementally and while avoiding major, breaking changes (probably).

Step 1: automated generation of tags

This is based on very basic and generic heuristics. This will help to populate the user interface with existing tags, and hopefully increase user engagement. The PR that implements this feature has been filed and is almost ready for review.

Release target: v7.12 (Feb. 1st).

Step 2.1: binary voting on tags

I’m currently working on this mechanism - research notes are private.

We introduce a simple, binary voting mechanism on tags. Users can vote for the accuracy/inaccuracy of tags and share these votes with other users. When estimating the trustworthiness of a tag, a user takes the opinion of other users into consideration. Users that created tags that are downvoted consistently have their tags suppressed in the user interface whereas users that create accurate tags have their tags promoted. This mechanism should be resilient against Sybils, collusion and a whitewashing attack.

Step 2.2: agent-based metadata enrichment

We open up our ecosystem to contributions of “metadata enrichment agents”. Each user can launch their own agent to enrich metadata. Users can vote on the generated tags of each agent. This is a key step towards content organisation in the Web3.

Release target: v7.13 (April 1st).

Step 3: introducing rules to automatically annotate content

We introduce lightweight rules to generate tags. The first iteration of this mechanism will introduce basic regex-based rules. A rule is short description that is capable of quickly generating tags on content that matches the regex. Agents can craft rules and share them with users in the network. These rules are also assigned a reputation and malicious rules should be quickly suppressed by the voting mechanism. This will build upon the results and findings of step 2.1.

Release target: v7.14 (June 1st).

The combined effect of these components will hopefully be a thriving, decentralized and self-organising ecosystem of tags, secure against various forms of manipulation. In parallel, we can also start working on search and tag navigation but there are no concrete plans for these features as of yet.

synctext commented 2 years ago

Semantic overlay related work. Useful for 'keyword search' on tags. After we passed the magical one million users we can enhance our overlay, perhaps similar to Ethereum. We could design something like what Ethereum has operational: "topic advertisement subsystem". "Topic table" is easier to understand than a "tagging database". Too fancy for us: "Sound and Music Recommendation with Knowledge Graphs"

synctext commented 2 years ago

Dataset for tagging: https://github.com/MTG/mtg-jamendo-dataset 55,525 tracks annotated by 87 genre tags, 40 instrument tags, and 56 mood/theme tags

devos50 commented 2 years ago

The current version of tags has been running successfully for a few months now, and we have seen several tags that have been created by different users. As the next step, we want to use these tags and our existing infrastructure to improve the search experience. Concretely, our first goal is to identify and bundle torrents that describe similar content (for example: ubuntu-22.04-desktop-amd64.iso and ubuntu-22.04.1-desktop-amd64.iso). To describe relations between entities, we might want to add an additional 'relation' field to each tag. This requires modifications to the tags networking protocol and database scheme.

Our upcoming improvements are also a key step towards readying our infrastructure to build and maintain a global knowledge graph. This knowledge graph can act as fundamental primitive for upcoming science in the domain of content search, content navigation, and eventually content recommendation.

devos50 commented 1 year ago

A preview of the 'edit metadata' screen that can be used to edit the metadata of particular torrent files. For this first iteration, users can edit the title, description, year and language fields.

196442102-16ef039f-49b6-42b2-a4a0-8a6061cdedcb

synctext commented 1 year ago

library science knowledge - related work. The manifestation versus item abstraction plus tagging.

synctext commented 1 year ago

"Justin Bieber is gay" scientific problem - tag spam

MeritRank is needed to fix this spam issue in the Tribler future. Fans and fame of artists also attracts Internet trolls. We have in the past cofounded the Musicbrainz music metadata library. This crowdsourcing library has a unique dataset of votes on tags with explicit spam. See the gay and black metal spam on artist Justin Bieber, (click to expand spam). Genres votes
pop 16
teen pop 8
dance-pop 0
hip hop 0
tropical house 0
black metal -1
christmas music -1
contemporary r&b -1
electropop -1
Other tags
2010s 1
2020s 1
awful 0
english 0
terrible 0
gay -2
amazing -3

Bieber has a profile page. Next step in our semantic search roadmap is modelling the split between concept and materialisation. The knowledge graph should contains both types of entries. See the 1994 early scientific beginnings of semantic data models with the 1) logic perspective and 2) object perspective

solution: gossip, signals, and reputation. Simple central reputation system of central profiles gihtub_reputation_synctext_A

Publication venue: https://www.frontiersin.org/research-topics/19868/human-centered-ai-crowd-computing or https://www.frontiersin.org/research-topics/19653/user-modeling-and-recommendations

synctext commented 1 year ago

Dev meeting brainstorm outcome: Martijn has/had a crawler running with Tag-crowdsourcing. Check status @drew2a and 1-day dataset analysis with live "remove tag" within Tribler 7.13 release?

drew2a commented 1 year ago

Crawler status is:

   Active: active (running) since Thu 2022-09-29 08:44:03 UTC; 7 months 25 days ago

Tags DB analysis

Dataset

select count(*)
from Peer; --408

select count(*)
from Torrent; --12842

select count(*)
from Tag; --4862

select count(*)
from TorrentTag; --17312

select count(*)
from TorrentTagOp; --17670

Add vs Remove operations count

select count(*), TorrentTagOp.operation 
from TorrentTagOp 
group by operation
Count Operation
17298 Add
372 Remove
image

Operations' distribution among peers

select count(*), TorrentTagOp.operation, TorrentTagOp.peer
from TorrentTagOp
group by operation, peer
image

Normalized:

image

Peers grouped by operation:

peers_add = {peer for _, operation, peer in data if operation == 1}
peers_remove = {peer for _, operation, peer in data if operation == 2}
add_and_remove = peers_add.intersection(peers_remove)

print(f'Add and Remove: {len(add_and_remove)}')  # 77
print(f'Only Add: {len(peers_add.difference(add_and_remove))}')  # 305
print(f'Only Remove: {len(peers_remove.difference(add_and_remove))}')  # 26
image
drew2a commented 8 months ago

To describe the current state of the KnowledgeCommunity, I will start with a description of the database.

Database

The full schema is available at https://github.com/Tribler/tribler/blob/main/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py

It describes Statements that are linked to a Peer (through a digital signature) who created that statement. To anonymize the peer, not their main public key is used, but a secondary key as described here: https://github.com/Tribler/tribler/blob/26b0be8c4010a6546f9204bb2f0026597dea53a7/src/tribler/core/components/key/key_component.py#L24-L26

In the KnowledgeGraph, a Statement is an edge with the following (simplified) structure: https://github.com/Tribler/tribler/blob/76de5620ed7f1e991e9d1adc4cb905ae8a5db8b3/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py#L60-L65

Where ResourceType is https://github.com/Tribler/tribler/blob/26b0be8c4010a6546f9204bb2f0026597dea53a7/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py#L32-L57

Statement examples:

SimpleStatement(subject_type=ResourceType.TORRENT, subject='infohash1', predicate=ResourceType.TAG, object='tag1')
SimpleStatement(subject_type=ResourceType.TORRENT, subject='infohash2', predicate=ResourceType.TAG, object='tag2')
SimpleStatement(subject_type=ResourceType.TORRENT, subject='infohash3', predicate=ResourceType.CONTENT_ITEM, object='content item')

Due to the inherent lack of trust in peers, we cannot simply replace an existing statement with a newly received one. Instead, we store all statement-peer pairs. This approach allows for the potential to downvote or upvote all statements associated with a particular peer, depending on whether that peer becomes trusted or not. Currently, we lack a reliable trust function, and this design was implemented with the anticipation that such a function would be developed in the future. This assumption has proven to be correct, as the design is compatible with the MeritRank. Furthermore, it appears that all necessary data is already present within the KnowledgeGraph, making it well-suited for future integration with trust evaluation mechanisms.

There are two operations available for peers:

https://github.com/Tribler/tribler/blob/26b0be8c4010a6546f9204bb2f0026597dea53a7/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py#L26-L29

All operations are recorded in the database, allowing for the calculation of the final score of a specific operation based on the cumulative actions taken by all peers. This approach enables a comprehensive assessment of each operation's overall impact within the network.

Currently, a simplistic approach is employed, which involves merely summing all the 'add' operations (+1) and subtracting the 'remove' operations (-1) across all peers. This method is intended to be replaced by a more sophisticated mechanism, the MeritRank merit function.

https://github.com/Tribler/tribler/blob/26b0be8c4010a6546f9204bb2f0026597dea53a7/src/tribler/core/components/database/db/layers/knowledge_data_access_layer.py#L98-L100

ER diagram

erDiagram
    Peer {
        int id PK "auto=True"
        bytes public_key "unique=True"
        datetime added_at "Optional, default=utcnow()"
    }

    Statement {
        int id PK "auto=True"
        int subject_id FK
        int object_id FK
        int added_count "default=0"
        int removed_count "default=0"
        int local_operation "Optional"
    }

    Resource {
        int id PK "auto=True"
        string name
        int type "ResourceType enum"
    }

    StatementOp {
        int id PK "auto=True"
        int statement_id FK
        int peer_id FK
        int operation
        int clock
        bytes signature
        datetime updated_at "default=utcnow()"
        bool auto_generated "default=False"
    }

    Misc {
        string name PK
        string value "Optional"
    }

    Statement }|--|| Resource : "subject_id"
    Statement }|--|| Resource : "object_id"
    StatementOp }|--|| Statement : "statement_id"
    StatementOp }|--|| Peer : "peer_id"
drew2a commented 8 months ago

The next chapter is dedicated to the community itself.

Community

https://github.com/Tribler/tribler/blob/main/src/tribler/core/components/knowledge/community/knowledge_community.py

The algorithm of the community's operation:

  1. Every 5 seconds, we request 10 StatementOperations from a random peer.

https://github.com/Tribler/tribler/blob/44e2235e0b3bcdc12ae2fcd874bc058474973e5b/src/tribler/core/components/knowledge/community/knowledge_payload.py#L8-L18

  1. Upon receiving a response, we verify the signatures of the operations and their validity.

https://github.com/Tribler/tribler/blob/44e2235e0b3bcdc12ae2fcd874bc058474973e5b/src/tribler/core/components/knowledge/community/knowledge_community.py#L126-L128

https://github.com/Tribler/tribler/blob/44e2235e0b3bcdc12ae2fcd874bc058474973e5b/src/tribler/core/components/knowledge/community/knowledge_community.py#L119-L124

  1. When StatementOperations are requested from us, we select N random operations (the number of operations is specified in the request) and return them.

Autogenerated Knowledge

In addition to the user-added knowledge statements, there is also auto-generated statements. The KnowledgeRulesProcessor was developed for the automatic generation of knowledge, which analyzes the records in the database and generates knowledge based on predefined regex patterns found in them.

For example here is a definition of autogenerated tags:

https://github.com/Tribler/tribler/blob/44e2235e0b3bcdc12ae2fcd874bc058474973e5b/src/tribler/core/components/knowledge/rules/rules_general_tags.py#L5-L27

This is a definition of Ubuntu, Debian and Linux Mint content items.

https://github.com/Tribler/tribler/blob/44e2235e0b3bcdc12ae2fcd874bc058474973e5b/src/tribler/core/components/knowledge/rules/rules_content_items.py#L6-L21

Auto-generation of knowledge occurs through two mechanisms:

  1. Traversing all database records in the background.
  2. Generating knowledge for newly created records (this includes records that came to us through remote search).

Auto-generated knowledge does not participate in gossip among the network.

drew2a commented 8 months ago

The third paragraph is dedicated to the UI.

UI

Three changes have been made to the UI:

  1. Elements for displaying tags have been added.

image

  1. A dialog for editing metadata has been introduced.

image

  1. A snippet has been added that groups elements by the CONTENT_ITEM field.

image

Also, a feature for searching by tags was added, but this feature hasn't been introduced to the users yet.

image
qstokkink commented 4 weeks ago

Tags have now been implemented and even documented (above). With that, this issue is complete.